LLMScraper
by ohlava
Find best scraper for your website and data you need.
Opens on Apify.com
About LLMScraper
Find best scraper for your website and data you need.
What does this actor do?
LLMScraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
🤖 LLM-Powered Web Scraper An intelligent Apify Actor that uses Claude AI to automatically discover, test, and select the best Apify actors for your web scraping tasks. No manual configuration needed! ## ✨ Features - 🧠 AI-Powered Actor Discovery: Uses Claude AI to automatically find and test the best Apify actors for your target website - 🔄 Smart Retry Logic: Automatically adjusts parameters and retries failed attempts with different actors - 📊 Quality Assessment: Evaluates scraped data quality across multiple dimensions (completeness, relevance, structure, volume) - 🎯 Priority-Based Testing: Tests domain-specific actors first, then falls back to general-purpose ones - 📈 Real-time Progress: Tracks and reports scraping progress with detailed logging - 🔗 MCP Integration: Connects to Apify MCP Server for dynamic actor discovery and execution - ⚙️ Flexible Configuration: Extensive customization options for timeout, quality thresholds, and model selection ## 🚀 Quick Start 1. Set up your Claude API key in the Actor input or as an environment variable 2. Provide your target URL and describe what data you want to extract 3. Run the Actor - it will automatically find and test the best scraping approach ### Example Input json { "targetUrl": "https://example-ecommerce.com/products/", "extractionGoal": "Extract product information including title, price, description, and availability", "claudeApiKey": "sk-ant-api03-...", "maxActorAttempts": 5, "maxTimeMinutes": 20 } ## 📝 Input Configuration ### Required Fields - targetUrl: The URL of the website you want to scrape - extractionGoal: Describe what data you want to extract from the website - claudeApiKey: Your Anthropic Claude API key for AI-powered analysis ### Optional Configuration - maxActorAttempts (default: 10): Maximum number of different actors to try - maxRetriesPerActor (default: 3): Maximum retry attempts per actor - maxTimeMinutes (default: 30): Maximum total execution time in minutes - modelName (default: "claude-3-5-haiku-latest"): Claude model to use - debugMode (default: false): Enable detailed logging - preferSpecificActors (default: true): Prioritize domain-specific actors - minDataQualityScore (default: 70): Minimum quality score (0-100) to accept results - enableProxy (default: true): Use proxy for scraping requests ### Available Claude Models - claude-3-5-haiku-latest - Fast & cost-effective (recommended) - claude-3-5-sonnet-latest - Balanced performance and quality - claude-3-opus-latest - Maximum quality (slower, more expensive) ## 📊 Output The Actor saves results to: ### Dataset Each scraped item with metadata: json { "url": "https://example.com", "data": {...}, "quality_score": 0.85, "actor_used": "apify/web-scraper", "timestamp": "2025-07-24T11:30:00Z", "success": true, "extraction_goal": "Extract product information", "total_execution_time": 45.2, "attempts_made": 3 } ### Key-Value Store Summary information in SCRAPING_RESULT: json { "success": true, "quality_score": 0.85, "items_count": 25, "best_actor_id": "apify/web-scraper", "total_execution_time": 45.2, "attempts_made": 3, "progress_updates": [...], "actor_attempts": [...] } ## 🔧 How It Works 1. Actor Discovery: Connects to Apify MCP Server to discover available actors 2. AI Analysis: Uses Claude to analyze the target website and select appropriate actors 3. Smart Testing: Tests actors in priority order with intelligent parameter adjustment 4. Quality Evaluation: Assesses data quality using multiple metrics 5. Retry Logic: Automatically retries with different parameters if needed 6. Result Selection: Returns the best results based on quality scores ## 🏗️ Architecture The Actor consists of several key components: - MCP Client (src/llmscraper/mcp/): Handles communication with Apify MCP Server - Claude Manager (src/llmscraper/claude/): Manages AI conversations and tool calls - LLM Scraper Actor (src/llmscraper/llm_scraper/): Main orchestration logic - Retry Logic (src/llmscraper/llm_scraper/retry_logic.py): Intelligent parameter adjustment - Quality Evaluator (src/llmscraper/llm_scraper/quality_evaluator.py): Data quality assessment ## 🔑 Environment Variables - ANTHROPIC_API_KEY: Your Anthropic Claude API key (alternative to input field) - APIFY_TOKEN: Automatically provided by Apify platform - MCP_SERVER_URL: Custom MCP server URL (optional) ## ⚡ Performance Tips 1. Use Haiku Model: For most tasks, claude-3-5-haiku-latest provides the best speed/cost ratio 2. Adjust Attempts: Reduce maxActorAttempts for faster results, increase for better coverage 3. Quality Threshold: Lower minDataQualityScore if you're getting no results 4. Time Limits: Set appropriate maxTimeMinutes based on your needs ## 🛠️ Development ### Local Testing bash # Install dependencies (using virtual environment) pip install -r requirements.txt # Or if you have the project's virtual environment: ./venv/bin/pip install -r requirements.txt # Set up environment export ANTHROPIC_API_KEY=your_key_here # Run the actor locally python3 main.py # Or using npm scripts: npm run start # Uses system python3 npm run start:local # Uses project virtual environment ### Project Structure text LLMScraper/ ├── main.py # Actor entry point ├── src/llmscraper/ │ ├── mcp/ # MCP client implementation │ ├── claude/ # Claude AI integration │ ├── llm_scraper/ # Main scraper logic │ │ ├── actor.py # Main LLMScraperActor class │ │ ├── models.py # Input/output models │ │ ├── retry_logic.py # Intelligent retry logic │ │ └── quality_evaluator.py # Data quality assessment │ ├── scraping/ # Apify actor integrations │ └── utils/ # Configuration and utilities ├── .actor/ │ ├── actor.json # Actor metadata │ ├── input_schema.json # Input validation schema │ └── README.md # This file ├── Dockerfile # Container configuration ├── requirements.txt # Python dependencies ├── package.json # Node.js metadata └── pyproject.toml # Python packaging configuration ## 📚 API Reference ### Main Function python from llmscraper.llm_scraper import LLMScraperActor, LLMScraperInput # Create configuration config = LLMScraperInput( target_url="https://example-website.com", extraction_goal="Extract product data", anthropic_api_key="sk-ant-..." ) # Run the scraper scraper = LLMScraperActor(config) result = await scraper.run(progress_callback=None) ### Configuration python from llmscraper.llm_scraper.models import LLMScraperInput config = LLMScraperInput( target_url="https://example-website.com", extraction_goal="Extract product data", anthropic_api_key="sk-ant-...", max_actor_attempts=10, max_retries_per_actor=3, max_time_minutes=30, model_name="claude-3-5-haiku-latest", debug_mode=False, prefer_specific_actors=True, min_data_quality_score=0.7, # Note: API expects 0.0-1.0, input form uses 0-100 enable_proxy=True ) ## 📄 License MIT License - see LICENSE file for details. ## 🆘 Support & Troubleshooting ### Common Issues - API Key Issues: Ensure your Claude API key is valid and has sufficient credits - No Results Found: Try reducing minDataQualityScore or increasing maxActorAttempts - Timeout Errors: Increase maxTimeMinutes for complex websites - Quality Score Too Low: Adjust your extractionGoal to be more specific ### Debugging - Enable debugMode: true for detailed logging - Check the Actor logs for step-by-step execution details - Verify the target URL is accessible and returns content - Monitor the progress updates in the key-value store ### Performance Optimization - Use claude-3-5-haiku-latest for faster, cost-effective processing - Set appropriate maxActorAttempts based on your time/quality requirements - Enable preferSpecificActors to prioritize domain-specific solutions
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try LLMScraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- ohlava
- Pricing
- Paid
- Total Runs
- 13
- Active Users
- 3
Related Actors
YouTube Video Transcript
by starvibe
Reddit Scraper
by macrocosmos
Perplexity 2.0
by winbayai
Idealista.com
by lukass
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support