RAG Knowledge Loader

Name: RAG Knowledge Loader
Author: botflowtech

by botflowtech

22 runs

2 users

Try This Actor

Opens on Apify.com

About RAG Knowledge Loader

RAG Knowledge Loader

What does this actor do?

RAG Knowledge Loader is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

RAG Knowledge Loader Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications. ## Features - Crawls entire documentation sites recursively - Extracts clean, structured content - Removes navigation, headers, footers automatically - Outputs vector-ready JSON format - Supports GitBook, ReadTheDocs, Notion, and custom doc sites ## Use Cases - Build "Chat with Docs" chatbots - Feed LLMs with up-to-date documentation - Create knowledge bases for RAG pipelines - Automated documentation updates for vector databases ## Input Parameters ### Required - Start URLs (required): Array of documentation site URLs to scrape - Example: `https://docs.apify.com/`, `https://your-gitbook-site.com` ### Optional Configuration - Max pages to crawl (default: 1000): Maximum number of pages to scrape - Minimum: 1 - Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns - Example: `["/api/", "/guides/"]` - Exclude URL patterns (globs) (default: `["**/*.pdf", "**/.zip", "/login", "/signup"]`): Skip URLs matching these patterns - Content CSS Selectors (default: `"article, main, .content, .markdown-body, #content, [role='main']"`): Comma-separated CSS selectors for main content area - Remove CSS Selectors (default: `"nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"`): Selectors for elements to remove like navigation and headers - Output Format (default: `"vector-ready"`): - `"vector-ready"`: Flat structure optimized for embeddings - `"hierarchical"`: Nested structure with full metadata - Crawler Type (default: `"cheerio"`): - `"cheerio"`: Fast HTTP crawler for static sites - `"playwright"`: Browser-based crawler for JavaScript-heavy sites ### Example Input JSON { "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/.pdf", "/login", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" } ### Minimal Input Example { "startUrls": [ { "url": "https://docs.example.com/" } ] } ## Output Format ### Vector-Ready Format (Default) Optimized for direct ingestion into vector databases: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] } ### Hierarchical Format Includes full document structure with headings and metadata: { "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] } ## Integration with Vector Databases The output is ready to use with popular RAG frameworks: - LangChain: Use JSONLoader to load documents -* LlamaIndex: Import as Document objects - Pinecone/Weaviate: Batch upsert with metadata - *Chroma**: Add to collection with embeddings

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try RAG Knowledge Loader now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: botflowtech
Pricing: Paid
Total Runs: 22
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

RAG Knowledge Loader

About RAG Knowledge Loader

What does this actor do?

Key Features

How to Use

Documentation

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?