RAG Pipeline Data Collector

RAG Pipeline Data Collector

by scraper_guru

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

11 runs
1 users
Try This Actor

Opens on Apify.com

About RAG Pipeline Data Collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

What does this actor do?

RAG Pipeline Data Collector is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

RAG Pipeline Data Collector - AI-Ready Web Content Extraction Extract clean, structured web content optimized for RAG systems, LLMs, and AI agents. Built with Crawl4AI for lightning-fast parallel processing and intelligent content filtering. ## 🎯 What is RAG Pipeline Data Collector? The RAG Pipeline Data Collector is a specialized web scraping Actor designed specifically for AI and machine learning workflows. It transforms raw web pages into clean, structured Markdown or HTML content that's ready to feed into RAG (Retrieval-Augmented Generation) systems, vector databases, LLM training pipelines, and AI agents. Unlike traditional web scrapers, this Actor focuses on extracting meaningful content while removing navigation menus, ads, footers, and other noise that would pollute your AI training data or RAG knowledge base. Perfect for: - πŸ€– Building RAG systems and AI chatbots - πŸ“š Creating knowledge bases for LLMs - πŸ” Training data collection for machine learning - πŸ’¬ Content ingestion for vector databases (Pinecone, Weaviate, Chroma) - πŸ”— n8n, Zapier, and Make.com automation workflows - πŸ“Š Large-scale content analysis and research ## πŸš€ Key Features ### Dual Operating Modes Single Page Mode - Fast API-style extraction - Extract individual pages in 15-30 seconds - Perfect for real-time integrations - Ideal for n8n/Zapier/Make workflows - On-demand content processing Multi-Page Mode - Bulk extraction with parallel processing - Process 50+ pages simultaneously - 5-10x faster than sequential scraping - Three intelligent crawl strategies - Complete knowledge base extraction ### Three Crawl Strategies 1. Sitemap Strategy πŸ“‹ - Automatically parse sitemap.xml - Fastest parallel processing - Complete site coverage - Best for: Documentation sites, blogs, news sites 2. Deep Crawl Strategy πŸ•ΈοΈ - Follow internal links recursively - Control depth (1-5 levels) - Discover hidden content - Best for: Sites without sitemaps, complex navigation 3. Archive Discovery πŸ“° - Intelligent pattern detection (/blog, /posts, /archive) - Targeted content discovery - Blog-focused extraction - Best for: Content-heavy sites, news archives ### Clean, AI-Ready Output βœ… Markdown Output - Perfectly formatted for LLMs βœ… Noise Removal - Intelligent filtering of ads, navigation, footers βœ… Metadata Extraction - Title, description, author, language βœ… Image URLs - All images with full URLs βœ… Link Extraction - Internal and external links separated βœ… Statistics - Word count, character count, image count ## πŸ’‘ Use Cases ### RAG Systems & Vector Databases Feed clean, structured content directly into your RAG pipeline: python # LangChain Integration from apify_client import ApifyClient from langchain.document_loaders import ApifyDatasetLoader client = ApifyClient("your-token") run = client.actor("YOUR_ACTOR_ID").call(run_input={ "scrape_mode": "multi", "start_url": "https://docs.example.com", "crawl_strategy": "sitemap", "max_pages": 100, "output_format": "markdown", "remove_noise": True }) loader = ApifyDatasetLoader( dataset_id=run["defaultDatasetId"], dataset_mapping_function=lambda item: item["content"] ) docs = loader.load() # Now feed docs to your vector database ### n8n Automation Workflows 1. Add Apify node to your workflow 2. Select RAG Pipeline Data Collector 3. Configure single or multi-page mode 4. Connect to Pinecone, Weaviate, or Supabase nodes 5. Automate your RAG data pipeline ### Content Analysis & Research Extract and analyze large volumes of content: - Competitor research and monitoring - Market intelligence gathering - Academic research data collection - Content aggregation for newsletters ### AI Training Data Collection Build high-quality training datasets: - Clean, structured text for fine-tuning - Consistent format across sources - Metadata for context preservation - Scalable bulk extraction ## πŸ“₯ Input Configuration ### Single Page Mode json { "scrape_mode": "single", "url": "https://example.com/article", "output_format": "markdown", "remove_noise": true, "include_images": true, "include_links": true, "include_metadata": true } ### Multi-Page Mode (Sitemap) json { "scrape_mode": "multi", "start_url": "https://docs.example.com", "crawl_strategy": "sitemap", "max_pages": 100, "output_format": "markdown", "remove_noise": true } ### Multi-Page Mode (Deep Crawl) json { "scrape_mode": "multi", "start_url": "https://blog.example.com", "crawl_strategy": "deep", "max_depth": 2, "max_pages": 50, "output_format": "markdown" } ### Multi-Page Mode (Archive Discovery) json { "scrape_mode": "multi", "start_url": "https://news.example.com/archive", "crawl_strategy": "archive", "max_pages": 200, "output_format": "markdown" } ## πŸ“€ Output Format Each scraped page returns a structured JSON object: json { "url": "https://example.com/article", "content": "# Article Title\n\nClean markdown content...", "format": "markdown", "statistics": { "word_count": 1500, "character_count": 8500, "image_count": 5, "internal_links": 12, "external_links": 3 }, "images": [ "https://example.com/image1.jpg", "https://example.com/image2.jpg" ], "links": { "internal": ["https://example.com/page1", "https://example.com/page2"], "external": ["https://external.com"] }, "metadata": { "title": "Article Title", "description": "Article description", "author": "Author Name", "language": "en" }, "scrape_mode": "single", "scraped_at": "2024-12-11T10:30:00Z" } ## πŸ”§ How It Works The Actor uses Crawl4AI, a cutting-edge web scraping framework optimized for AI applications: 1. Intelligent Rendering - Handles JavaScript-heavy sites with Playwright 2. Parallel Processing - Scrapes multiple pages simultaneously (5-20x faster) 3. Noise Filtering - Removes ads, navigation, footers using fit_markdown algorithm 4. LLM-Optimized Output - Clean Markdown perfect for AI consumption 5. Smart Crawling - Three strategies to handle any site structure ### Performance Expectations - Single Page Mode: 15-30 seconds per page - Multi-Page (Sitemap): 1-2 minutes for 50 pages - Multi-Page (Deep Crawl): 2-5 minutes for 50 pages (varies by depth) - Multi-Page (Archive): 1-3 minutes for 50 pages ## πŸ’° Pricing & Compute Units This Actor is optimized for cost-effective operation: - Single Page Mode: ~0.05-0.1 CU per page - Multi-Page Mode: ~2-5 CU per 50 pages (parallel processing advantage) Recommended Memory: 4096 MB for optimal performance ## πŸ“Š Example Runs Coming soon! Check back for public run examples. ## πŸ› οΈ Advanced Features ### Output Format Options - Markdown: Clean, LLM-friendly format (recommended for RAG) - HTML: Cleaned HTML with noise removed - Raw HTML: Original HTML without processing ### Content Filtering - Noise Removal: Automatically removes navigation, ads, footers - Image Filtering: Include/exclude images - Link Filtering: Include/exclude links - Metadata Control: Include/exclude page metadata ### Crawl Configuration - Max Pages: Control total pages (1-500) - Max Depth: Control crawl depth (1-5 levels) - Same Domain Only: Restrict to starting domain - Pattern Matching: Custom URL filtering (coming soon) ## πŸ”— Integration Examples ### Make.com (Integromat) 1. Add Apify module 2. Select Run Actor 3. Choose RAG Pipeline Data Collector 4. Configure input parameters 5. Map output to your RAG pipeline modules ### Zapier 1. Add Apify action 2. Select Run Actor 3. Choose RAG Pipeline Data Collector 4. Configure trigger and input 5. Connect to vector database action ### Python SDK python from apify_client import ApifyClient client = ApifyClient("your-token") # Single page extraction run = client.actor("YOUR_ACTOR_ID").call(run_input={ "scrape_mode": "single", "url": "https://example.com/article", "output_format": "markdown" }) # Get results for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item["content"]) ### JavaScript/Node.js javascript const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'your-token', }); const run = await client.actor('YOUR_ACTOR_ID').call({ scrape_mode: 'multi', start_url: 'https://docs.example.com', crawl_strategy: 'sitemap', max_pages: 50 }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach((item) => { console.log(item.url, item.statistics.word_count); }); ## βš™οΈ Configuration Tips ### For Best RAG Results - βœ… Enable remove_noise for cleaner content - βœ… Use markdown output format - βœ… Include metadata for context - βœ… Set appropriate max_pages based on your needs ### For Faster Scraping - ⚑ Use sitemap strategy when available - ⚑ Limit max_depth to 1-2 for deep crawl - ⚑ Process in batches of 50-100 pages - ⚑ Use 4096 MB memory allocation ### For Cost Optimization - πŸ’° Use single mode for small jobs - πŸ’° Batch requests in multi-page mode - πŸ’° Set reasonable max_pages limits - πŸ’° Monitor compute unit usage ## πŸ› Troubleshooting ### Sitemap Not Found If sitemap strategy fails, the Actor automatically falls back to deep crawl. ### JavaScript-Heavy Sites Some sites may require additional wait time. The Actor handles this automatically with Playwright. ### Rate Limiting The Actor respects robots.txt and includes configurable delays between requests. ### Missing Content If content is missing, try: - Disable noise removal temporarily - Use raw_html format to inspect - Increase timeout settings ## πŸ“š Documentation & Support - GitHub Issues: [Report bugs or request features] - Apify Discord: Join our community for support - Documentation: Full API documentation ## 🏷️ Tags web-scraping rag llm ai machine-learning vector-database langchain chatbot knowledge-base content-extraction markdown automation n8n zapier make ## πŸ“„ License This Actor is provided as-is for use on the Apify platform. Web scraping should be done responsibly and in accordance with website terms of service. ## 🀝 Ethical Scraping This Actor: - βœ… Respects robots.txt - βœ… Only extracts publicly available content - βœ… Does not extract personal data - βœ… Includes configurable rate limiting - βœ… Identifies itself properly in requests Always ensure you have the right to scrape content from target websites and respect their terms of service. --- Built with ❀️ using Crawl4AI Need custom features or enterprise support? Contact us through the Apify platform!

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try RAG Pipeline Data Collector now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
scraper_guru
Pricing
Paid
Total Runs
11
Active Users
1
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support