LLMScraper

Name: LLMScraper
Author: ohlava

by ohlava

Find best scraper for your website and data you need.

13 runs

3 users

Try This Actor

Opens on Apify.com

About LLMScraper

Find best scraper for your website and data you need.

What does this actor do?

LLMScraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

🤖 LLM-Powered Web Scraper An intelligent Apify Actor that uses Claude AI to automatically discover, test, and select the best Apify actors for your web scraping tasks. No manual configuration needed! ## ✨ Features - 🧠 AI-Powered Actor Discovery: Uses Claude AI to automatically find and test the best Apify actors for your target website - 🔄 Smart Retry Logic: Automatically adjusts parameters and retries failed attempts with different actors - 📊 Quality Assessment: Evaluates scraped data quality across multiple dimensions (completeness, relevance, structure, volume) - 🎯 Priority-Based Testing: Tests domain-specific actors first, then falls back to general-purpose ones - 📈 Real-time Progress: Tracks and reports scraping progress with detailed logging - 🔗 MCP Integration: Connects to Apify MCP Server for dynamic actor discovery and execution - ⚙️ Flexible Configuration: Extensive customization options for timeout, quality thresholds, and model selection ## 🚀 Quick Start 1. Set up your Claude API key in the Actor input or as an environment variable 2. Provide your target URL and describe what data you want to extract 3. Run the Actor - it will automatically find and test the best scraping approach ### Example Input `json { "targetUrl": "https://example-ecommerce.com/products/", "extractionGoal": "Extract product information including title, price, description, and availability", "claudeApiKey": "sk-ant-api03-...", "maxActorAttempts": 5, "maxTimeMinutes": 20 }` ## 📝 Input Configuration ### Required Fields - `targetUrl`: The URL of the website you want to scrape - `extractionGoal`: Describe what data you want to extract from the website - `claudeApiKey`: Your Anthropic Claude API key for AI-powered analysis ### Optional Configuration - `maxActorAttempts` (default: 10): Maximum number of different actors to try - `maxRetriesPerActor` (default: 3): Maximum retry attempts per actor - `maxTimeMinutes` (default: 30): Maximum total execution time in minutes - `modelName` (default: "claude-3-5-haiku-latest"): Claude model to use - `debugMode` (default: false): Enable detailed logging - `preferSpecificActors` (default: true): Prioritize domain-specific actors - `minDataQualityScore` (default: 70): Minimum quality score (0-100) to accept results - `enableProxy` (default: true): Use proxy for scraping requests ### Available Claude Models - `claude-3-5-haiku-latest` - Fast & cost-effective (recommended) - `claude-3-5-sonnet-latest` - Balanced performance and quality - `claude-3-opus-latest` - Maximum quality (slower, more expensive) ## 📊 Output The Actor saves results to: ### Dataset Each scraped item with metadata: `json { "url": "https://example.com", "data": {...}, "quality_score": 0.85, "actor_used": "apify/web-scraper", "timestamp": "2025-07-24T11:30:00Z", "success": true, "extraction_goal": "Extract product information", "total_execution_time": 45.2, "attempts_made": 3 }` ### Key-Value Store Summary information in `SCRAPING_RESULT`: `json { "success": true, "quality_score": 0.85, "items_count": 25, "best_actor_id": "apify/web-scraper", "total_execution_time": 45.2, "attempts_made": 3, "progress_updates": [...], "actor_attempts": [...] }` ## 🔧 How It Works 1. Actor Discovery: Connects to Apify MCP Server to discover available actors 2. AI Analysis: Uses Claude to analyze the target website and select appropriate actors 3. Smart Testing: Tests actors in priority order with intelligent parameter adjustment 4. Quality Evaluation: Assesses data quality using multiple metrics 5. Retry Logic: Automatically retries with different parameters if needed 6. Result Selection: Returns the best results based on quality scores ## 🏗️ Architecture The Actor consists of several key components: - MCP Client (`src/llmscraper/mcp/`): Handles communication with Apify MCP Server - Claude Manager (`src/llmscraper/claude/`): Manages AI conversations and tool calls - LLM Scraper Actor (`src/llmscraper/llm_scraper/`): Main orchestration logic - Retry Logic (`src/llmscraper/llm_scraper/retry_logic.py`): Intelligent parameter adjustment - Quality Evaluator (`src/llmscraper/llm_scraper/quality_evaluator.py`): Data quality assessment ## 🔑 Environment Variables - `ANTHROPIC_API_KEY`: Your Anthropic Claude API key (alternative to input field) - `APIFY_TOKEN`: Automatically provided by Apify platform - `MCP_SERVER_URL`: Custom MCP server URL (optional) ## ⚡ Performance Tips 1. Use Haiku Model: For most tasks, `claude-3-5-haiku-latest` provides the best speed/cost ratio 2. Adjust Attempts: Reduce `maxActorAttempts` for faster results, increase for better coverage 3. Quality Threshold: Lower `minDataQualityScore` if you're getting no results 4. Time Limits: Set appropriate `maxTimeMinutes` based on your needs ## 🛠️ Development ### Local Testing `bash # Install dependencies (using virtual environment) pip install -r requirements.txt # Or if you have the project's virtual environment: ./venv/bin/pip install -r requirements.txt # Set up environment export ANTHROPIC_API_KEY=your_key_here # Run the actor locally python3 main.py # Or using npm scripts: npm run start # Uses system python3 npm run start:local # Uses project virtual environment` ### Project Structure text LLMScraper/ ├── main.py # Actor entry point ├── src/llmscraper/ │ ├── mcp/ # MCP client implementation │ ├── claude/ # Claude AI integration │ ├── llm_scraper/ # Main scraper logic │ │ ├── actor.py # Main LLMScraperActor class │ │ ├── models.py # Input/output models │ │ ├── retry_logic.py # Intelligent retry logic │ │ └── quality_evaluator.py # Data quality assessment │ ├── scraping/ # Apify actor integrations │ └── utils/ # Configuration and utilities ├── .actor/ │ ├── actor.json # Actor metadata │ ├── input_schema.json # Input validation schema │ └── README.md # This file ├── Dockerfile # Container configuration ├── requirements.txt # Python dependencies ├── package.json # Node.js metadata └── pyproject.toml # Python packaging configuration ## 📚 API Reference ### Main Function `python from llmscraper.llm_scraper import LLMScraperActor, LLMScraperInput # Create configuration config = LLMScraperInput( target_url="https://example-website.com", extraction_goal="Extract product data", anthropic_api_key="sk-ant-..." ) # Run the scraper scraper = LLMScraperActor(config) result = await scraper.run(progress_callback=None)` ### Configuration `python from llmscraper.llm_scraper.models import LLMScraperInput config = LLMScraperInput( target_url="https://example-website.com", extraction_goal="Extract product data", anthropic_api_key="sk-ant-...", max_actor_attempts=10, max_retries_per_actor=3, max_time_minutes=30, model_name="claude-3-5-haiku-latest", debug_mode=False, prefer_specific_actors=True, min_data_quality_score=0.7, # Note: API expects 0.0-1.0, input form uses 0-100 enable_proxy=True )` ## 📄 License MIT License - see LICENSE file for details. ## 🆘 Support & Troubleshooting ### Common Issues - API Key Issues: Ensure your Claude API key is valid and has sufficient credits - No Results Found: Try reducing `minDataQualityScore` or increasing `maxActorAttempts` - Timeout Errors: Increase `maxTimeMinutes` for complex websites - Quality Score Too Low: Adjust your `extractionGoal` to be more specific ### Debugging - Enable `debugMode: true` for detailed logging - Check the Actor logs for step-by-step execution details - Verify the target URL is accessible and returns content - Monitor the progress updates in the key-value store ### Performance Optimization - Use `claude-3-5-haiku-latest` for faster, cost-effective processing - Set appropriate `maxActorAttempts` based on your time/quality requirements - Enable `preferSpecificActors` to prioritize domain-specific solutions

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try LLMScraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: ohlava
Pricing: Paid
Total Runs: 13
Active Users: 3

Related Actors

YouTube Video Transcript

by starvibe

Reddit Scraper

by macrocosmos

Perplexity 2.0

by winbayai

Idealista.com

by lukass

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support