AI Website Content Markdown Scraper
by quaking_pail
This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract thei...
Opens on Apify.com
About AI Website Content Markdown Scraper
This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.
What does this actor do?
AI Website Content Markdown Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
๐ Apify Actor: Markdown Website Crawler ๐ง Overview This Apify Actor crawls a website starting from a list of given URLs, performs a search using a selected search engine to find more relevant URLs within the same domain, scrapes and cleans the main content of the pages, and outputs the result in Markdown format. It uses Selenium with a headless Chrome browser to accurately render JavaScript-heavy websites and extract readable content. Unwanted scripts, ads, headers, footers, and cookie banners are removed to ensure clean and focused output. โ๏ธ Input Schema The Actor accepts the following input fields: Field Type Description start_urls Array Array of objects with a url key. These are the starting points of the crawl. max_depth Integer Maximum crawl depth (how far it should follow links from the start page). max_urls Integer Maximum number of pages to scrape in total. search_engine String (Optional) Which search engine to use to find additional URLs. One of: Google, Bing, or DuckDuckGo. Default: Google Example input json Copier Modifier { "start_urls": [ { "url": "https://apify.com" } ], "max_depth": 1, "max_urls": 10, "search_engine": "Google" } ๐ค Output Format Each result pushed to the dataset contains: Field Type Description url String The URL of the scraped page. title String The page's title (as seen in the browser tab). content String The cleaned Markdown version of the main page content. ๐ Functionality 1. Search Engine Discovery Uses Google, Bing, or DuckDuckGo to search for the domain. Extracts links that belong to the same root domain. Adds those links to the crawl queue. 2. Crawling & Scraping Opens each valid page. Strips unwanted elements: scripts, headers, footers, styles, iframes, videos, cookie banners. Extracts main, article, section, and div content. Converts the HTML to Markdown using markdownify. 3. Cleaning Markdown Removes broken or irrelevant Markdown syntax. Filters out image tags, inline SVGs, tracking text, and known cookie policy messages. Trims and normalizes white space. ๐ Limitations The scraper is designed to stay within the same root domain as the starting URL. Heavy JavaScript pages may still fail if they block bots or detect automation. Search engine interaction is subject to changes in their HTML structure and may break over time. ๐งช Development Notes Browser automation is powered by Selenium and ChromeDriver. Designed for use in Apify's headless actor environment with Chromium. Requests are tracked using Apify's RequestQueue with deduplication. ๐งผ Cleanup The browser (driver.quit()) is gracefully closed at the end. Requests are marked as handled after processing. ๐ Usage This Actor is ideal for: Archiving or monitoring content changes. SEO content extraction. Research on company websites or competitor analysis.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try AI Website Content Markdown Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- quaking_pail
- Pricing
- Paid
- Total Runs
- 13,642
- Active Users
- 840
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
๐ฅ Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support