Universal Markdown Scraper for LLMs
by botflowtech
Universal Markdown Scraper for LLMs
Opens on Apify.com
About Universal Markdown Scraper for LLMs
Universal Markdown Scraper for LLMs
What does this actor do?
Universal Markdown Scraper for LLMs is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Universal Markdown Scraper for LLMs Transform any webpage into clean, token-efficient Markdown optimized for ChatGPT, Claude, and other Large Language Models. This Actor automatically removes ads, navigation bars, footers, and other noise that wastes valuable API tokens. ## Why Use This Actor? Most web scrapers return messy JSON or raw HTML that's unsuitable for LLM context windows. This Actor solves that problem by: - Extracting only main content using Mozilla's Readability algorithm - Converting to clean Markdown format that LLMs process efficiently - Removing token-wasting elements like ads, sidebars, cookie banners, and navigation - Providing token estimates so you know exactly how much context you're using - Processing at scale with concurrent URL handling Perfect for AI developers building RAG systems, chatbots, research assistants, and content analysis tools. ## Use Cases - RAG (Retrieval Augmented Generation): Extract clean content for vector databases and knowledge bases - AI Research Assistants: Feed web articles directly into ChatGPT/Claude for analysis - Content Summarization: Get article text without noise for LLM-powered summarizers - Documentation Processing: Convert technical docs to Markdown for AI-powered Q&A systems - News Monitoring: Extract clean article content for sentiment analysis and topic modeling - Training Data Preparation: Collect high-quality text data for fine-tuning LLMs ## Input The Actor accepts the following input parameters: | Field | Type | Required | Description | |-------|------|----------|-------------| | startUrls | Array | Yes | List of URLs to scrape. Format: [{ "url": "https://example.com" }] | | maxConcurrency | Integer | No | Number of pages to process simultaneously (1-50, default: 10) | | removeImages | Boolean | No | Strip all images to save tokens (default: false) | | removeLinks | Boolean | No | Convert hyperlinks to plain text to save tokens (default: false) | ### Example Input { "startUrls": [ { "url": "https://apify.com/about" }, { "url": "https://openai.com/research" }, { "url": "https://www.anthropic.com/news" } ], "maxConcurrency": 5, "removeImages": false, "removeLinks": false } ## Output The Actor outputs clean Markdown with metadata for each URL processed. Results are stored in the default dataset. ### Example Output { "url": "https://apify.com/about", "title": "About Apify - Web Scraping and Automation Platform", "markdown": "# About Apify\n\nApify is a cloud platform for web scraping...", "author": "Apify Team", "excerpt": "Learn about Apify's mission to make the web more accessible...", "contentLength": 4521, "markdownLength": 3842, "estimatedTokens": 960, "processedAt": "2025-12-06T06:44:22.195Z", "success": true, "error": null } ### Output Fields - url - Original URL that was scraped - title - Extracted page title - markdown - Clean Markdown content ready for LLM input - author - Article author (if detected) - excerpt - Brief content summary - contentLength - Character count of extracted content - markdownLength - Character count of Markdown output - estimatedTokens - Approximate token count (1 token ≈ 4 characters) - processedAt - ISO timestamp of processing - success - Boolean indicating if extraction succeeded - error - Error message (if success is false) ## Features ### Intelligent Content Extraction Uses Mozilla's Readability library to identify main article content while automatically removing: - Navigation menus and headers - Sidebars and advertisements - Footers and copyright notices - Cookie banners and popups - Social media widgets - Comment sections - Related article suggestions ### Token Optimization - Markdown format: More efficient than HTML for LLM processing - Optional image removal: Save tokens by excluding image references - Optional link removal: Convert links to plain text when URLs aren't needed - Token estimates: Know upfront how much of your context window you'll use ### Production-Ready - Error handling: Gracefully handles failed requests and parsing errors - Concurrent processing: Process multiple URLs simultaneously - Detailed logging: Track processing status in real-time - Fallback extraction: Uses body content if Readability fails ## Cost Efficiency Running this Actor is cost-effective for AI development: - Compute units: ~0.01 CU per page (approximately $0.003 USD) - Speed: Average 2-3 seconds per URL - Batch processing: Process 100 URLs for ~$0.30 Compare this to the cost of wasted API tokens from unprocessed HTML! ## How to Use ### Via Apify Console 1. Open the Actor in Apify Console 2. Click "Try for free" 3. Enter your URLs in the startUrls field 4. Configure optional parameters 5. Click "Start" and wait for results 6. Download Markdown from the dataset ### Via API const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'YOUR_API_TOKEN', }); const run = await client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call({ startUrls: [ { url: 'https://example.com/article' } ], removeImages: true }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items.markdown); ### Integration with LangChain from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader client = ApifyClient('YOUR_API_TOKEN') Run the Actor run = client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call( run_input={'startUrls': [{'url': 'https://example.com'}]} ) Load into LangChain loader = ApifyDatasetLoader( dataset_id=run['defaultDatasetId'], dataset_mapping_function=lambda item: item['markdown'] ) docs = loader.load() ## Limitations - JavaScript-heavy sites: Some dynamic content may not render. Consider using a browser-based scraper for SPAs. - Paywalled content: Cannot access content behind authentication walls - Rate limiting: Respect target website rate limits using maxConcurrency - Token estimation: Approximate only; actual tokens vary by model tokenizer ## Tips for Best Results - Start small: Test with 5-10 URLs before scaling up - Enable token optimization: Use removeImages and removeLinks for RAG systems that don't need them - Monitor output: Check the estimatedTokens field to stay within context limits - Handle errors: Always check the success field before using Markdown output ## Support & Feedback Need help or have suggestions? - Issues: Report bugs via the Issues tab - Questions: Contact us through Apify support - Feature requests: We're actively developing this Actor and welcome feedback ## Version History v1.0.0 (December 2025) - Initial release - Mozilla Readability integration - Token estimation - Configurable image and link removal - Batch processing support --- Built with ❤️ for the AI development community. Save tokens, save money, build better AI applications.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Universal Markdown Scraper for LLMs now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- botflowtech
- Pricing
- Paid
- Total Runs
- 10
- Active Users
- 3
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support