Docs To Rag

Name: Docs To Rag
Author: gabrielaxy

by gabrielaxy

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

8 runs

2 users

Try This Actor

Opens on Apify.com

About Docs To Rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

What does this actor do?

Docs To Rag is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Website Health Monitor - Apify Actor Reference A comprehensive reference implementation demonstrating all major Apify Actor features using CheerioCrawler. This Actor monitors website health by checking URLs for status codes, load times, and broken links. ## Purpose This template serves as a copy-paste reference for building Apify Actors. It demonstrates every major Apify feature in a single, working Actor that you can use as a starting point for your own projects. ## Features Demonstrated ### 1. Actor Lifecycle `typescript // main.ts - Actor.main() handles init/exit automatically Actor.main(async () => { // Your Actor code here // Actor.init() called automatically at start // Actor.exit() called automatically at end });` ### 2. Input Handling `typescript // Get typed input from Actor.getInput() const rawInput = await Actor.getInput<ActorInput>(); const input = validateInput(rawInput);` Input Schema (`input_schema.json`): - `urls` (array, required) - URLs to monitor - `maxConcurrency` (integer, default: 5) - Concurrent requests - `proxyConfig` (object) - Apify proxy configuration - `notifyOnFailure` (boolean) - Enable failure notifications - `notificationActorId` (string) - Actor to call on failures - `webhookUrl` (string) - Webhook to trigger on completion ### 3. Storage - Dataset `typescript // Push results to dataset during crawling await Dataset.pushData(healthCheckResult); // Access dataset info const dataset = await Actor.openDataset(); const info = await dataset.getInfo();` Dataset Output Schema: `typescript { url: string; status: number; loadTime: number; pageTitle: string | null; brokenLinks: string[]; totalLinks: number; isHealthy: boolean; errorMessage: string | null; timestamp: string; }` ### 4. Storage - Key-Value Store `typescript // Open default Key-Value Store const kvStore = await Actor.openKeyValueStore(); // Read INPUT (alternative to Actor.getInput()) const input = await kvStore.getValue('INPUT'); // Write OUTPUT summary await kvStore.setValue('OUTPUT', summary); // Write file with content type await kvStore.setValue('SCREENSHOT_STATUS', content, { contentType: 'application/json', });` ### 5. Storage - Request Queue `typescript // Open request queue const requestQueue = await Actor.openRequestQueue(); // Add requests with user data await requestQueue.addRequest({ url: 'https://example.com', userData: { originalUrl: url, startTime: Date.now() }, });` ### 6. Crawlee Integration (CheerioCrawler) typescript import { CheerioCrawler, createCheerioRouter } from 'crawlee'; // Create router with handlers const router = createCheerioRouter(); router.addDefaultHandler(async ({ request, response, $, log }) => { // Extract data using Cheerio const title = $('title').text(); const links = $('a[href]').map((_, el) => $(el).attr('href')).get(); // Push to dataset await Dataset.pushData({ url: request.url, title, links }); }); // Create crawler const crawler = new CheerioCrawler({ requestQueue, requestHandler: router, failedRequestHandler: async ({ request, error }) => { // Handle failed requests }, maxConcurrency: 5, maxRequestRetries: 3, }); // Run crawler await crawler.run(); ### 7. Proxy Configuration `typescript // Create proxy from input configuration const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); // Use in crawler const crawler = new CheerioCrawler({ proxyConfiguration, // ... });` ### 8. Actor-to-Actor Communication `typescript // Call another Actor and wait for result const run = await Actor.call( 'apify/send-email', // Actor ID { // Input for called Actor subject: 'Alert', message: 'Something happened', }, { memory: 256, // Memory in MB timeout: 60, // Timeout in seconds } ); // Start Actor without waiting (fire and forget) const run = await Actor.start('apify/some-actor', input); // Call Actor and get dataset items const { items } = await Actor.callTask('user/my-task', input);` ### 9. Logging `typescript // Different log levels Actor.log.debug('Detailed debug info'); Actor.log.info('General information'); Actor.log.warning('Non-critical warning'); Actor.log.error('Error occurred', { error: err.message }); // Log with structured data Actor.log.info('Processing URL', { url: 'https://example.com', status: 200, loadTime: 150, });` ### 10. Status Messages `typescript // Update Actor status (visible in Apify Console) await Actor.setStatusMessage('Processing URL 5/10...'); await Actor.setStatusMessage('✓ Completed successfully');` ### 11. Environment Information `typescript // Get Actor environment variables const env = Actor.getEnv(); console.log({ actorId: env.actorId, actorRunId: env.actorRunId, userId: env.userId, memoryMbytes: env.memoryMbytes, isAtHome: env.isAtHome, defaultDatasetId: env.defaultDatasetId, defaultKeyValueStoreId: env.defaultKeyValueStoreId, startedAt: env.startedAt, timeoutAt: env.timeoutAt, });` ### 12. Graceful Shutdown `typescript // Handle Actor migration (server change) Actor.on('migrating', async () => { // Save state before migration const kvStore = await Actor.openKeyValueStore(); await kvStore.setValue('MIGRATION_STATE', currentState); }); // Handle Actor abort Actor.on('aborting', async () => { // Save partial results await Dataset.pushData(partialResults); }); // Other events: 'persistState', 'systemInfo'` ### 13. Standby Mode (HTTP Server) typescript // Create HTTP server for standby mode const server = await Actor.createServer(async (req, res) => { const url = new URL(req.url || '/', `http://${req.headers.host}`); if (url.pathname === '/health') { res.writeHead(200, { 'Content-Type': 'application/json' }); res.end(JSON.stringify({ status: 'running', urlsProcessed: 10, memoryUsageMB: 128, })); } else { res.writeHead(404); res.end('Not found'); } }); // Server is automatically bound to Actor's port ## File Structure `templates/cheerio-reference/ ├── src/ │ ├── main.ts # Entry point with Actor.main() │ ├── routes.ts # Crawlee router handlers │ ├── types.ts # TypeScript interfaces │ └── utils.ts # Helper functions ├── package.json # Dependencies ├── tsconfig.json # TypeScript config ├── Dockerfile # Multi-stage build ├── .actor/ │ ├── actor.json # Actor metadata │ └── input_schema.json # Input schema └── README.md # This file` ## Running Locally `bash # Install dependencies npm install # Build TypeScript npm run build # Run with test input echo '{"urls": ["https://example.com"]}' | npx apify-cli run -p` ## Deploying to Apify `bash # Login to Apify npx apify-cli login # Push to Apify platform npx apify-cli push` ## Output ### Dataset Items Each checked URL produces a dataset item: `json { "url": "https://example.com", "status": 200, "loadTime": 523, "pageTitle": "Example Domain", "brokenLinks": [], "totalLinks": 15, "isHealthy": true, "errorMessage": null, "timestamp": "2024-01-15T10:30:00.000Z" }` ### Key-Value Store OUTPUT Summary of the health check run: `json { "totalChecked": 10, "failedCount": 2, "successCount": 8, "avgLoadTime": 450, "totalBrokenLinks": 5, "failedUrls": ["https://broken.example.com"], "startTime": "2024-01-15T10:00:00.000Z", "endTime": "2024-01-15T10:05:00.000Z", "durationSeconds": 300 }` ## Quick Reference: Common Patterns ### Reading Files from Key-Value Store `typescript const kvStore = await Actor.openKeyValueStore(); const data = await kvStore.getValue('MY_DATA');` ### Writing Binary Files `typescript await kvStore.setValue('image.png', buffer, { contentType: 'image/png', });` ### Named Stores `typescript // Open named stores (persist across runs) const kvStore = await Actor.openKeyValueStore('my-store'); const dataset = await Actor.openDataset('my-dataset'); const queue = await Actor.openRequestQueue('my-queue');` ### Metamorph (Transform Actor) `typescript // Transform into another Actor await Actor.metamorph('apify/web-scraper', newInput);` ### Abort Run `typescript // Abort with status message await Actor.fail('Critical error occurred'); // Exit successfully await Actor.exit('Completed');` ## License ISC

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Docs To Rag now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: gabrielaxy
Pricing: Paid
Total Runs: 8
Active Users: 2

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support