Universal Markdown Scraper for LLMs

Name: Universal Markdown Scraper for LLMs
Author: botflowtech

by botflowtech

10 runs

3 users

Try This Actor

Opens on Apify.com

About Universal Markdown Scraper for LLMs

Universal Markdown Scraper for LLMs

What does this actor do?

Universal Markdown Scraper for LLMs is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Universal Markdown Scraper for LLMs Transform any webpage into clean, token-efficient Markdown optimized for ChatGPT, Claude, and other Large Language Models. This Actor automatically removes ads, navigation bars, footers, and other noise that wastes valuable API tokens. ## Why Use This Actor? Most web scrapers return messy JSON or raw HTML that's unsuitable for LLM context windows. This Actor solves that problem by: - Extracting only main content using Mozilla's Readability algorithm - Converting to clean Markdown format that LLMs process efficiently - Removing token-wasting elements like ads, sidebars, cookie banners, and navigation - Providing token estimates so you know exactly how much context you're using - Processing at scale with concurrent URL handling Perfect for AI developers building RAG systems, chatbots, research assistants, and content analysis tools. ## Use Cases - RAG (Retrieval Augmented Generation): Extract clean content for vector databases and knowledge bases - AI Research Assistants: Feed web articles directly into ChatGPT/Claude for analysis - Content Summarization: Get article text without noise for LLM-powered summarizers - Documentation Processing: Convert technical docs to Markdown for AI-powered Q&A systems - News Monitoring: Extract clean article content for sentiment analysis and topic modeling - Training Data Preparation: Collect high-quality text data for fine-tuning LLMs ## Input The Actor accepts the following input parameters: | Field | Type | Required | Description | |-------|------|----------|-------------| | `startUrls` | Array | Yes | List of URLs to scrape. Format: `[{ "url": "https://example.com" }]` | | `maxConcurrency` | Integer | No | Number of pages to process simultaneously (1-50, default: 10) | | `removeImages` | Boolean | No | Strip all images to save tokens (default: false) | | `removeLinks` | Boolean | No | Convert hyperlinks to plain text to save tokens (default: false) | ### Example Input { "startUrls": [ { "url": "https://apify.com/about" }, { "url": "https://openai.com/research" }, { "url": "https://www.anthropic.com/news" } ], "maxConcurrency": 5, "removeImages": false, "removeLinks": false } ## Output The Actor outputs clean Markdown with metadata for each URL processed. Results are stored in the default dataset. ### Example Output { "url": "https://apify.com/about", "title": "About Apify - Web Scraping and Automation Platform", "markdown": "# About Apify\n\nApify is a cloud platform for web scraping...", "author": "Apify Team", "excerpt": "Learn about Apify's mission to make the web more accessible...", "contentLength": 4521, "markdownLength": 3842, "estimatedTokens": 960, "processedAt": "2025-12-06T06:44:22.195Z", "success": true, "error": null } ### Output Fields - `url` - Original URL that was scraped - `title` - Extracted page title - `markdown` - Clean Markdown content ready for LLM input - `author` - Article author (if detected) - `excerpt` - Brief content summary - `contentLength` - Character count of extracted content - `markdownLength` - Character count of Markdown output - `estimatedTokens` - Approximate token count (1 token ≈ 4 characters) - `processedAt` - ISO timestamp of processing - `success` - Boolean indicating if extraction succeeded - `error` - Error message (if `success` is false) ## Features ### Intelligent Content Extraction Uses Mozilla's Readability library to identify main article content while automatically removing: - Navigation menus and headers - Sidebars and advertisements - Footers and copyright notices - Cookie banners and popups - Social media widgets - Comment sections - Related article suggestions ### Token Optimization - Markdown format: More efficient than HTML for LLM processing - Optional image removal: Save tokens by excluding image references - Optional link removal: Convert links to plain text when URLs aren't needed - Token estimates: Know upfront how much of your context window you'll use ### Production-Ready - Error handling: Gracefully handles failed requests and parsing errors - Concurrent processing: Process multiple URLs simultaneously - Detailed logging: Track processing status in real-time - Fallback extraction: Uses body content if Readability fails ## Cost Efficiency Running this Actor is cost-effective for AI development: - Compute units: ~0.01 CU per page (approximately $0.003 USD) - Speed: Average 2-3 seconds per URL - Batch processing: Process 100 URLs for ~$0.30 Compare this to the cost of wasted API tokens from unprocessed HTML! ## How to Use ### Via Apify Console 1. Open the Actor in Apify Console 2. Click "Try for free" 3. Enter your URLs in the `startUrls` field 4. Configure optional parameters 5. Click "Start" and wait for results 6. Download Markdown from the dataset ### Via API const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'YOUR_API_TOKEN', }); const run = await client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call({ startUrls: [ { url: 'https://example.com/article' } ], removeImages: true }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items.markdown); ### Integration with LangChain from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader client = ApifyClient('YOUR_API_TOKEN') Run the Actor run = client.actor('YOUR_USERNAME/universal-markdown-scraper-llm').call( run_input={'startUrls': [{'url': 'https://example.com'}]} ) Load into LangChain loader = ApifyDatasetLoader( dataset_id=run['defaultDatasetId'], dataset_mapping_function=lambda item: item['markdown'] ) docs = loader.load() ## Limitations - JavaScript-heavy sites: Some dynamic content may not render. Consider using a browser-based scraper for SPAs. - Paywalled content: Cannot access content behind authentication walls - Rate limiting: Respect target website rate limits using `maxConcurrency` - Token estimation: Approximate only; actual tokens vary by model tokenizer ## Tips for Best Results - Start small: Test with 5-10 URLs before scaling up - Enable token optimization: Use `removeImages` and `removeLinks` for RAG systems that don't need them - Monitor output: Check the `estimatedTokens` field to stay within context limits - Handle errors: Always check the `success` field before using Markdown output ## Support & Feedback Need help or have suggestions? - Issues: Report bugs via the Issues tab - Questions: Contact us through Apify support - Feature requests: We're actively developing this Actor and welcome feedback ## Version History v1.0.0 (December 2025) - Initial release - Mozilla Readability integration - Token estimation - Configurable image and link removal - Batch processing support --- Built with ❤️ for the AI development community. Save tokens, save money, build better AI applications.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Universal Markdown Scraper for LLMs now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: botflowtech
Pricing: Paid
Total Runs: 10
Active Users: 3

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Universal Markdown Scraper for LLMs

About Universal Markdown Scraper for LLMs

What does this actor do?

Key Features

How to Use

Documentation

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?