AI Website Content Markdown Scraper

AI Website Content Markdown Scraper

by quaking_pail

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract thei...

13,642 runs
840 users
Try This Actor

Opens on Apify.com

About AI Website Content Markdown Scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

What does this actor do?

AI Website Content Markdown Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

๐Ÿ“„ Apify Actor: Markdown Website Crawler ๐Ÿง  Overview This Apify Actor crawls a website starting from a list of given URLs, performs a search using a selected search engine to find more relevant URLs within the same domain, scrapes and cleans the main content of the pages, and outputs the result in Markdown format. It uses Selenium with a headless Chrome browser to accurately render JavaScript-heavy websites and extract readable content. Unwanted scripts, ads, headers, footers, and cookie banners are removed to ensure clean and focused output. โš™๏ธ Input Schema The Actor accepts the following input fields: Field Type Description start_urls Array Array of objects with a url key. These are the starting points of the crawl. max_depth Integer Maximum crawl depth (how far it should follow links from the start page). max_urls Integer Maximum number of pages to scrape in total. search_engine String (Optional) Which search engine to use to find additional URLs. One of: Google, Bing, or DuckDuckGo. Default: Google Example input json Copier Modifier { "start_urls": [ { "url": "https://apify.com" } ], "max_depth": 1, "max_urls": 10, "search_engine": "Google" } ๐Ÿ“ค Output Format Each result pushed to the dataset contains: Field Type Description url String The URL of the scraped page. title String The page's title (as seen in the browser tab). content String The cleaned Markdown version of the main page content. ๐Ÿ” Functionality 1. Search Engine Discovery Uses Google, Bing, or DuckDuckGo to search for the domain. Extracts links that belong to the same root domain. Adds those links to the crawl queue. 2. Crawling & Scraping Opens each valid page. Strips unwanted elements: scripts, headers, footers, styles, iframes, videos, cookie banners. Extracts main, article, section, and div content. Converts the HTML to Markdown using markdownify. 3. Cleaning Markdown Removes broken or irrelevant Markdown syntax. Filters out image tags, inline SVGs, tracking text, and known cookie policy messages. Trims and normalizes white space. ๐Ÿ›‘ Limitations The scraper is designed to stay within the same root domain as the starting URL. Heavy JavaScript pages may still fail if they block bots or detect automation. Search engine interaction is subject to changes in their HTML structure and may break over time. ๐Ÿงช Development Notes Browser automation is powered by Selenium and ChromeDriver. Designed for use in Apify's headless actor environment with Chromium. Requests are tracked using Apify's RequestQueue with deduplication. ๐Ÿงผ Cleanup The browser (driver.quit()) is gracefully closed at the end. Requests are marked as handled after processing. ๐Ÿš€ Usage This Actor is ideal for: Archiving or monitoring content changes. SEO content extraction. Research on company websites or competitor analysis.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try AI Website Content Markdown Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
quaking_pail
Pricing
Paid
Total Runs
13,642
Active Users
840
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support