Docs To Rag

Docs To Rag

by gabrielaxy

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

8 runs
2 users
Try This Actor

Opens on Apify.com

About Docs To Rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

What does this actor do?

Docs To Rag is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Website Health Monitor - Apify Actor Reference A comprehensive reference implementation demonstrating all major Apify Actor features using CheerioCrawler. This Actor monitors website health by checking URLs for status codes, load times, and broken links. ## Purpose This template serves as a copy-paste reference for building Apify Actors. It demonstrates every major Apify feature in a single, working Actor that you can use as a starting point for your own projects. ## Features Demonstrated ### 1. Actor Lifecycle typescript // main.ts - Actor.main() handles init/exit automatically Actor.main(async () => { // Your Actor code here // Actor.init() called automatically at start // Actor.exit() called automatically at end }); ### 2. Input Handling typescript // Get typed input from Actor.getInput() const rawInput = await Actor.getInput<ActorInput>(); const input = validateInput(rawInput); Input Schema (input_schema.json): - urls (array, required) - URLs to monitor - maxConcurrency (integer, default: 5) - Concurrent requests - proxyConfig (object) - Apify proxy configuration - notifyOnFailure (boolean) - Enable failure notifications - notificationActorId (string) - Actor to call on failures - webhookUrl (string) - Webhook to trigger on completion ### 3. Storage - Dataset typescript // Push results to dataset during crawling await Dataset.pushData(healthCheckResult); // Access dataset info const dataset = await Actor.openDataset(); const info = await dataset.getInfo(); Dataset Output Schema: typescript { url: string; status: number; loadTime: number; pageTitle: string | null; brokenLinks: string[]; totalLinks: number; isHealthy: boolean; errorMessage: string | null; timestamp: string; } ### 4. Storage - Key-Value Store typescript // Open default Key-Value Store const kvStore = await Actor.openKeyValueStore(); // Read INPUT (alternative to Actor.getInput()) const input = await kvStore.getValue('INPUT'); // Write OUTPUT summary await kvStore.setValue('OUTPUT', summary); // Write file with content type await kvStore.setValue('SCREENSHOT_STATUS', content, { contentType: 'application/json', }); ### 5. Storage - Request Queue typescript // Open request queue const requestQueue = await Actor.openRequestQueue(); // Add requests with user data await requestQueue.addRequest({ url: 'https://example.com', userData: { originalUrl: url, startTime: Date.now() }, }); ### 6. Crawlee Integration (CheerioCrawler) typescript import { CheerioCrawler, createCheerioRouter } from 'crawlee'; // Create router with handlers const router = createCheerioRouter(); router.addDefaultHandler(async ({ request, response, $, log }) => { // Extract data using Cheerio const title = $('title').text(); const links = $('a[href]').map((_, el) => $(el).attr('href')).get(); // Push to dataset await Dataset.pushData({ url: request.url, title, links }); }); // Create crawler const crawler = new CheerioCrawler({ requestQueue, requestHandler: router, failedRequestHandler: async ({ request, error }) => { // Handle failed requests }, maxConcurrency: 5, maxRequestRetries: 3, }); // Run crawler await crawler.run(); ### 7. Proxy Configuration typescript // Create proxy from input configuration const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); // Use in crawler const crawler = new CheerioCrawler({ proxyConfiguration, // ... }); ### 8. Actor-to-Actor Communication typescript // Call another Actor and wait for result const run = await Actor.call( 'apify/send-email', // Actor ID { // Input for called Actor subject: 'Alert', message: 'Something happened', }, { memory: 256, // Memory in MB timeout: 60, // Timeout in seconds } ); // Start Actor without waiting (fire and forget) const run = await Actor.start('apify/some-actor', input); // Call Actor and get dataset items const { items } = await Actor.callTask('user/my-task', input); ### 9. Logging typescript // Different log levels Actor.log.debug('Detailed debug info'); Actor.log.info('General information'); Actor.log.warning('Non-critical warning'); Actor.log.error('Error occurred', { error: err.message }); // Log with structured data Actor.log.info('Processing URL', { url: 'https://example.com', status: 200, loadTime: 150, }); ### 10. Status Messages typescript // Update Actor status (visible in Apify Console) await Actor.setStatusMessage('Processing URL 5/10...'); await Actor.setStatusMessage('✓ Completed successfully'); ### 11. Environment Information typescript // Get Actor environment variables const env = Actor.getEnv(); console.log({ actorId: env.actorId, actorRunId: env.actorRunId, userId: env.userId, memoryMbytes: env.memoryMbytes, isAtHome: env.isAtHome, defaultDatasetId: env.defaultDatasetId, defaultKeyValueStoreId: env.defaultKeyValueStoreId, startedAt: env.startedAt, timeoutAt: env.timeoutAt, }); ### 12. Graceful Shutdown typescript // Handle Actor migration (server change) Actor.on('migrating', async () => { // Save state before migration const kvStore = await Actor.openKeyValueStore(); await kvStore.setValue('MIGRATION_STATE', currentState); }); // Handle Actor abort Actor.on('aborting', async () => { // Save partial results await Dataset.pushData(partialResults); }); // Other events: 'persistState', 'systemInfo' ### 13. Standby Mode (HTTP Server) typescript // Create HTTP server for standby mode const server = await Actor.createServer(async (req, res) => { const url = new URL(req.url || '/', `http://${req.headers.host}`); if (url.pathname === '/health') { res.writeHead(200, { 'Content-Type': 'application/json' }); res.end(JSON.stringify({ status: 'running', urlsProcessed: 10, memoryUsageMB: 128, })); } else { res.writeHead(404); res.end('Not found'); } }); // Server is automatically bound to Actor's port ## File Structure templates/cheerio-reference/ ├── src/ │ ├── main.ts # Entry point with Actor.main() │ ├── routes.ts # Crawlee router handlers │ ├── types.ts # TypeScript interfaces │ └── utils.ts # Helper functions ├── package.json # Dependencies ├── tsconfig.json # TypeScript config ├── Dockerfile # Multi-stage build ├── .actor/ │ ├── actor.json # Actor metadata │ └── input_schema.json # Input schema └── README.md # This file ## Running Locally bash # Install dependencies npm install # Build TypeScript npm run build # Run with test input echo '{"urls": ["https://example.com"]}' | npx apify-cli run -p ## Deploying to Apify bash # Login to Apify npx apify-cli login # Push to Apify platform npx apify-cli push ## Output ### Dataset Items Each checked URL produces a dataset item: json { "url": "https://example.com", "status": 200, "loadTime": 523, "pageTitle": "Example Domain", "brokenLinks": [], "totalLinks": 15, "isHealthy": true, "errorMessage": null, "timestamp": "2024-01-15T10:30:00.000Z" } ### Key-Value Store OUTPUT Summary of the health check run: json { "totalChecked": 10, "failedCount": 2, "successCount": 8, "avgLoadTime": 450, "totalBrokenLinks": 5, "failedUrls": ["https://broken.example.com"], "startTime": "2024-01-15T10:00:00.000Z", "endTime": "2024-01-15T10:05:00.000Z", "durationSeconds": 300 } ## Quick Reference: Common Patterns ### Reading Files from Key-Value Store typescript const kvStore = await Actor.openKeyValueStore(); const data = await kvStore.getValue('MY_DATA'); ### Writing Binary Files typescript await kvStore.setValue('image.png', buffer, { contentType: 'image/png', }); ### Named Stores typescript // Open named stores (persist across runs) const kvStore = await Actor.openKeyValueStore('my-store'); const dataset = await Actor.openDataset('my-dataset'); const queue = await Actor.openRequestQueue('my-queue'); ### Metamorph (Transform Actor) typescript // Transform into another Actor await Actor.metamorph('apify/web-scraper', newInput); ### Abort Run typescript // Abort with status message await Actor.fail('Critical error occurred'); // Exit successfully await Actor.exit('Completed'); ## License ISC

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Docs To Rag now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
gabrielaxy
Pricing
Paid
Total Runs
8
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support