RAG Markdown Cleaner
by cooldev
Transform web pages into RAG-ready Markdown with smart chunking, metadata, code detection & quality scoring. Production-tested deduplication. Fully op...
Opens on Apify.com
About RAG Markdown Cleaner
Transform web pages into RAG-ready Markdown with smart chunking, metadata, code detection & quality scoring. Production-tested deduplication. Fully open-source (Apache 2.0)βreview code, contribute, or self-host. Turn messy HTML into embedding-ready knowledge instantly.
What does this actor do?
RAG Markdown Cleaner is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Web to Markdown Converter for RAG Convert web pages into clean, token-efficient Markdown optimized for Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems with advanced features for production use. ## π Features ### Core Extraction - Clean Content Extraction: Uses Mozilla's Readability algorithm to extract only the main article content, removing navigation, footers, ads, and other clutter - Markdown Conversion: Converts HTML to clean, readable Markdown format - Smart Link Resolution: Automatically converts relative URLs to absolute URLs for functional references ### RAG Optimization - Smart Semantic Chunking: Automatically splits content into embedding-friendly chunks (configurable size, default 1000 chars) - Preserves heading hierarchy as context breadcrumbs - Configurable overlap between chunks for better continuity - Token estimates for each chunk - Metadata Extraction: Captures author, publish date, keywords, description, language, and content type - Code Block Detection: Identifies and extracts code blocks with language detection - Separate tracking of fenced code blocks and inline code - Line counts and language identification ### Quality & Deduplication - Content Quality Metrics: - Text density, word count, reading time - Paragraph and sentence analysis - Structure scoring (presence of lists, headings, links) - Deduplication Hashing: SHA256 content hashing and similarity hashing for duplicate detection - Production Ready: Docker-based, robust error handling, deployable to Apify platform ## π Use Cases - Building RAG knowledge bases from web content with optimal chunk sizes - Preparing training data for LLMs with quality metrics - Creating clean documentation from web pages with preserved structure - Batch processing articles with deduplication - Web content archival in Markdown format with full metadata ## π― Input The actor accepts the following input parameters: ### start_urls (required) Array of URL objects to scrape and convert. Example: json [ { "url": "https://example.com/article1" }, { "url": "https://example.com/article2" } ] ### include_links (optional, default: true) ## π€ Output The actor pushes results to the default Apify Dataset. Each scraped URL produces a comprehensive object: json { "url": "https://example.com/article", "title": "Article Title", "markdown_content": "# Article Title\n\nClean markdown content...", "chunks": [ { "chunk_id": 1, "content": "First semantic chunk...", "heading_context": "Introduction > Background", "char_count": 850, "estimated_tokens": 212 } ], "metadata": { "url": "https://example.com/article", "domain": "example.com", "scraped_at": "2025-12-11T01:23:45.123Z", "author": "John Doe", "publish_date": "2024-01-15", "last_modified": "2024-03-20", "language": "en", "keywords": ["web scraping", "python", "rag"], "description": "Article description from meta tags", "content_type": "blog" }, "code_blocks": { "fenced_blocks": [ { "language": "python", "code": "def example():\n pass", "lines": 2 } ], "inline_code_count": 15, "has_code": true }, "quality_metrics": { "text_density": 0.75, "paragraph_count": 25, "word_count": 1500, "sentence_count": 85, "avg_sentence_length": 18.5, "reading_time_minutes": 8, "has_lists": true, "has_headings": true, "has_links": true, "structure_score": 1.0 }, "hashes": { "content_hash": "a1b2c3d4e5f6...", "similarity_hash": "1234567890abcdef" }, "total_chunks": 3, "total_chars": 2500, "estimated_tokens": 625 } ### Output Fields Basic Info: - url: The original URL that was scraped - title: The extracted page title - markdown_content: The main content converted to Markdown format (with absolute URLs) Chunks (for RAG): - chunks: Array of semantic chunks optimized for embeddings - chunk_id: Sequential identifier - content: The chunk text - heading_context: Breadcrumb trail of headings (e.g., "Chapter 1 > Section 2") - char_count: Character count - estimated_tokens: Approximate token count (1 token β 4 chars) Metadata: - domain: Extracted domain name - scraped_at: ISO timestamp of scraping - author: Page author (from meta tags or JSON-LD) - publish_date: Publication date - last_modified: Last modification date - language: Two-letter language code - keywords: Array of keywords from meta tags - description: Page description - content_type: Detected type (blog, documentation, wiki, product, general) Code Blocks: - fenced_blocks: Array of code blocks with language and line count - inline_code_count: Number of inline code snippets - has_code: Boolean indicating presence of code Quality Metrics: - text_density: Ratio of text to HTML (higher = cleaner content) - word_count: Total words - reading_time_minutes: Estimated reading time - structure_score: 0-1 score based on presence of lists, headings, and links Deduplication: - content_hash: SHA256 hash for exact duplicate detection - similarity_hash: Normalized hash for near-duplicate detection The actor pushes results to the default Apify Dataset. Each scraped URL produces one object: json { "url": "https://example.com/article", "title": "Article Title", "markdown_content": "# Article Title\n\nClean markdown content..." } ### Output Fields - url: The original URL that was scraped ## π‘ Tips - Chunking: Default chunk size is 1000 characters with 100 character overlap - ideal for most embedding models - Rate Limiting: For large batches, consider adding delays between requests - Custom Headers: The actor includes a User-Agent header to avoid blocks - Timeout: Default timeout is 30 seconds per URL - Error Handling: Failed URLs are logged but don't stop the actor ## π Example Use in RAG Pipeline python # After running the actor, use the output in your RAG system from apify_client import ApifyClient client = ApifyClient(token='YOUR_TOKEN') run = client.actor('YOUR_ACTOR_ID').call(run_input={ 'start_urls': [ {'url': 'https://docs.example.com/page1'}, {'url': 'https://docs.example.com/page2'} ], 'include_links': False }) # Get dataset items with all enhancements for item in client.dataset(run['defaultDatasetId']).iterate_items(): # Filter by quality if item['quality_metrics']['structure_score'] < 0.5: continue # Skip low-quality content # Check for duplicates using similarity hash if item['hashes']['similarity_hash'] in seen_hashes: continue # Skip duplicate seen_hashes.add(item['hashes']['similarity_hash']) # Process each chunk for embeddings for chunk in item['chunks']: # Use heading context for better retrieval context = f"{item['metadata']['domain']} - {chunk['heading_context']}" # Embed with your preferred model embedding = embed_model.encode(chunk['content']) # Store in vector database with metadata vector_db.upsert( id=f"{item['url']}#{chunk['chunk_id']}", vector=embedding, metadata={ 'url': item['url'], 'title': item['title'], 'context': context, 'content_type': item['metadata']['content_type'], 'has_code': item['code_blocks']['has_code'], 'chunk_tokens': chunk['estimated_tokens'] } )Run locally bash apify run ### Testing Locally Create a .actor/INPUT.json file: json { "start_urls": [ { "url": "https://example.com" } ], "include_links": true } Then run: bash python src/main.py ## π’ Deployment ### Deploy to Apify 1. Install Apify CLI bash npm install -g apify-cli 2. Login to Apify bash apify login 3. Push to Apify bash apify push ### Docker Build To build the Docker image locally: bash docker build -t web-to-markdown . ## π¦ Dependencies - apify: Apify SDK for Python - requests: HTTP library for fetching web pages - readability-lxml: Extract main content from HTML - markdownify: Convert HTML to Markdown - lxml: XML/HTML processing library ## π§ Configuration The actor is configured via: - .actor/input_schema.json: Defines the input UI in Apify Console - Dockerfile: Container configuration - requirements.txt: Python dependencies ## π‘ Tips - Rate Limiting: For large batches, consider adding delays between requests - Custom Headers: The actor includes a User-Agent header to avoid blocks - Timeout: Default timeout is 30 seconds per URL - Error Handling: Failed URLs are logged but don't stop the actor ## π Example Use in RAG Pipeline python # After running the actor, use the output in your RAG system from apify_client import ApifyClient client = ApifyClient(token='YOUR_TOKEN') run = client.actor('YOUR_ACTOR_ID').call(run_input={ 'start_urls': [ {'url': 'https://docs.example.com/page1'}, {'url': 'https://docs.example.com/page2'} ], 'include_links': False }) # Get dataset items for item in client.dataset(run['defaultDatasetId']).iterate_items(): # Feed markdown_content to your embedding model process_for_rag(item['markdown_content']) ## π Troubleshooting No output produced: - Check that URLs are accessible and return HTML content - Verify input format matches the schema Markdown quality issues: - Some pages may have poor HTML structure - Readability works best with article-style content Timeout errors: - Increase timeout in requests.get() if needed - Check network connectivity ## π License This project is open source and available for use in RAG and LLM applications. ## π€ Contributing Contributions are welcome! Please feel free to submit issues or pull requests. ## π§ Support For issues or questions, please open a GitHub issue or contact via Apify platform.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try RAG Markdown Cleaner now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- cooldev
- Pricing
- Paid
- Total Runs
- 10
- Active Users
- 2
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
π₯ Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support