RAG Knowledge Loader
by botflowtech
RAG Knowledge Loader
Opens on Apify.com
About RAG Knowledge Loader
RAG Knowledge Loader
What does this actor do?
RAG Knowledge Loader is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
RAG Knowledge Loader Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications. ## Features - Crawls entire documentation sites recursively - Extracts clean, structured content - Removes navigation, headers, footers automatically - Outputs vector-ready JSON format - Supports GitBook, ReadTheDocs, Notion, and custom doc sites ## Use Cases - Build "Chat with Docs" chatbots - Feed LLMs with up-to-date documentation - Create knowledge bases for RAG pipelines - Automated documentation updates for vector databases ## Input Parameters ### Required - Start URLs (required): Array of documentation site URLs to scrape - Example: https://docs.apify.com/, https://your-gitbook-site.com ### Optional Configuration - Max pages to crawl (default: 1000): Maximum number of pages to scrape - Minimum: 1 - Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns - Example: ["**/api/**", "**/guides/**"] - Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns - Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area - Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers - Output Format (default: "vector-ready"): - "vector-ready": Flat structure optimized for embeddings - "hierarchical": Nested structure with full metadata - Crawler Type (default: "cheerio"): - "cheerio": Fast HTTP crawler for static sites - "playwright": Browser-based crawler for JavaScript-heavy sites ### Example Input JSON { "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/.pdf", "/login", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" } ### Minimal Input Example { "startUrls": [ { "url": "https://docs.example.com/" } ] } ## Output Format ### Vector-Ready Format (Default) Optimized for direct ingestion into vector databases: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] } ### Hierarchical Format Includes full document structure with headings and metadata: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] } ## Integration with Vector Databases The output is ready to use with popular RAG frameworks: - LangChain: Use JSONLoader to load documents - LlamaIndex: Import as Document objects - Pinecone/Weaviate: Batch upsert with metadata - Chroma*: Add to collection with embeddings
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try RAG Knowledge Loader now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- botflowtech
- Pricing
- Paid
- Total Runs
- 22
- Active Users
- 2
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support