RAG Knowledge Loader

RAG Knowledge Loader

by botflowtech

RAG Knowledge Loader

22 runs
2 users
Try This Actor

Opens on Apify.com

About RAG Knowledge Loader

RAG Knowledge Loader

What does this actor do?

RAG Knowledge Loader is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

RAG Knowledge Loader Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications. ## Features - Crawls entire documentation sites recursively - Extracts clean, structured content - Removes navigation, headers, footers automatically - Outputs vector-ready JSON format - Supports GitBook, ReadTheDocs, Notion, and custom doc sites ## Use Cases - Build "Chat with Docs" chatbots - Feed LLMs with up-to-date documentation - Create knowledge bases for RAG pipelines - Automated documentation updates for vector databases ## Input Parameters ### Required - Start URLs (required): Array of documentation site URLs to scrape - Example: https://docs.apify.com/, https://your-gitbook-site.com ### Optional Configuration - Max pages to crawl (default: 1000): Maximum number of pages to scrape - Minimum: 1 - Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns - Example: ["**/api/**", "**/guides/**"] - Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns - Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area - Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers - Output Format (default: "vector-ready"): - "vector-ready": Flat structure optimized for embeddings - "hierarchical": Nested structure with full metadata - Crawler Type (default: "cheerio"): - "cheerio": Fast HTTP crawler for static sites - "playwright": Browser-based crawler for JavaScript-heavy sites ### Example Input JSON { "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/.pdf", "/login", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" } ### Minimal Input Example { "startUrls": [ { "url": "https://docs.example.com/" } ] } ## Output Format ### Vector-Ready Format (Default) Optimized for direct ingestion into vector databases: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] } ### Hierarchical Format Includes full document structure with headings and metadata: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] } ## Integration with Vector Databases The output is ready to use with popular RAG frameworks: - LangChain: Use JSONLoader to load documents - LlamaIndex: Import as Document objects - Pinecone/Weaviate: Batch upsert with metadata - Chroma*: Add to collection with embeddings

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try RAG Knowledge Loader now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
botflowtech
Pricing
Paid
Total Runs
22
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support