Enhanced Deep Content Crawler

by assertive_analogy

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migrat...

406 runs
25 users
Try This Actor

Opens on Apify.com

About Enhanced Deep Content Crawler

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

What does this actor do?

Enhanced Deep Content Crawler is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Enhanced Deep Content Crawler A powerful, production-ready web crawler for comprehensive content extraction built with modern Python technologies. This enhanced crawler combines the efficiency of Crawlee for Python with advanced content extraction capabilities, intelligent duplicate detection, and robust error handling. ### 🚀 Key Features - Dual Crawling Modes: Choose between Playwright (JavaScript support) or HTTP-only crawling for optimal performance - Smart Content Extraction: Automatic main content detection with fallback strategies - Comprehensive Metadata: Extracts titles, descriptions, Open Graph data, Twitter Cards, and JSON-LD structured data - Duplicate Detection: Content-based deduplication using hashing algorithms - Advanced Link Discovery: Intelligent internal link extraction with domain validation - Real-time Statistics: Live progress tracking with performance metrics - Robust Error Handling: Retry mechanisms with exponential backoff - Flexible Configuration: Extensive customization options via input schema ## Included features - Apify SDK - a toolkit for building Apify Actors in Python. - Crawlee for Python - a web scraping and browser automation library. - Input schema - define and validate a schema for your Actor's input. - Request queue - manage the URLs you want to scrape in a queue. - Dataset - store and access structured data extracted from web pages. - Beautiful Soup - a library for pulling data out of HTML and XML files. - Playwright - modern browser automation for JavaScript-heavy sites. ### 📊 Data Extraction Capabilities The enhanced crawler extracts comprehensive data from each page: #### Basic Information - URL: Complete page URL with timestamp - Title: Page title from <title> tag - Description: Meta description and other metadata - Word Count: Total words in main content - Page Size: HTML document size in bytes #### Advanced Metadata - Open Graph Data: Complete OG tags for social sharing - Twitter Card Data: Twitter-specific metadata - Structured Data: JSON-LD schemas for rich snippets - Canonical URLs: Preferred page URLs - Author Information: When available in meta tags #### Content Analysis - Main Content: Intelligently extracted article/page content - Custom Content: User-defined CSS selector extraction - Content Hash: For duplicate detection - Internal Links: All same-domain links with metadata - Images: Up to 10 images with alt text and metadata ### ⚡ Performance Features - Dual Mode Operation: - Browser Mode (Playwright): Full JavaScript support, perfect for SPAs - HTTP Mode: Lightning-fast for static content - Concurrent Processing: Asynchronous crawling with proper queue management - Smart Deduplication: Avoids crawling identical content - Progress Tracking: Real-time statistics and performance metrics - Memory Efficient: Optimized for large-scale crawling ### 🛡️ Reliability & Error Handling - Retry Logic: Configurable retry attempts with exponential backoff - Timeout Management: Customizable request timeouts - Error Categorization: Detailed error logging and reporting - Graceful Degradation: Continues crawling despite individual page failures - URL Validation: Prevents crawling invalid or dangerous URLs ## Resources - Video introduction to Python SDK - Webinar introducing to Crawlee for Python - Apify Python SDK documentation - Crawlee for Python documentation - Python tutorials in Academy - Integration with Make, GitHub, Zapier, Google Drive, and other apps - Video guide on getting scraped data using Apify API ## 🛠️ Configuration Options ### Basic Configuration json { "baseUrl": "https://example.com", "maxPages": 100, "maxDepth": 3 } ### Advanced Configuration json { "baseUrl": "https://example.com", "maxPages": 500, "maxDepth": 4, "contentSelector": "article, .post-content, .entry-content", "excludePatterns": [ ".*\\.pdf$", "/admin/.*", ".*\\?.*utm_.*" ], "useBrowser": true, "maxRetries": 5, "requestTimeout": 45000 } ### Configuration Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | baseUrl | string | required | Starting URL for crawling | | maxPages | integer | 100 | Maximum pages to crawl (1-1000) | | maxDepth | integer | 3 | Maximum link depth to follow (1-10) | | contentSelector | string | "body" | CSS selector for content extraction | | excludePatterns | array | [] | Regex patterns to exclude URLs | | useBrowser | boolean | true | Enable JavaScript with Playwright | | maxRetries | integer | 3 | Retry attempts for failed requests | | requestTimeout | integer | 30000 | Request timeout in milliseconds | ## 🚀 Usage Examples ### E-commerce Site Crawling json { "baseUrl": "https://shop.example.com", "maxPages": 200, "contentSelector": ".product-description, .product-specs", "excludePatterns": [ "/cart.*", "/checkout.*", "/account.*" ] } ### News Site Crawling json { "baseUrl": "https://news.example.com", "maxPages": 1000, "maxDepth": 2, "contentSelector": "article, .post-content", "useBrowser": false, "excludePatterns": [ "/tag/.*", "/author/.*", ".*\\?.*utm_.*" ] } ### Documentation Site Crawling json { "baseUrl": "https://docs.example.com", "maxPages": 300, "contentSelector": ".markdown-body, .content", "excludePatterns": [ ".*\\.pdf$", ".*\\.zip$" ] } ## 📊 Output Data Structure Each crawled page produces a comprehensive data object: json { "url": "https://example.com/page", "crawled_at": "2024-01-15T10:30:00Z", "content_hash": "a1b2c3d4e5f6", "metadata": { "title": "Page Title", "description": "Page description", "keywords": "keyword1, keyword2", "author": "Author Name", "canonical_url": "https://example.com/canonical", "og_data": { "title": "Social Title", "description": "Social Description", "image": "https://example.com/image.jpg" }, "structured_data": [ { "@type": "Article", "headline": "Article Title" } ] }, "main_content": "Extracted main content text...", "specific_content": "Content from custom selector...", "images": [ { "src": "https://example.com/image.jpg", "alt": "Image description", "title": "Image title", "width": "800", "height": "600" } ], "internal_links": [ { "url": "https://example.com/linked-page", "text": "Link text", "title": "Link title" } ], "word_count": 1250, "page_size": 45678, "load_time": 1.23 } - A short guide on how to build web scrapers using code templates: web scraper template ## Getting started For complete information see this article. In short, you will: 1. Build the Actor 2. Run the Actor ## Pull the Actor for local development If you would like to develop locally, you can pull the existing Actor from Apify console using Apify CLI: 1. Install apify-cli Using Homebrew bash brew install apify-cli Using NPM bash npm -g install apify-cli 2. Pull the Actor by its unique <ActorId>, which is one of the following: - unique name of the Actor to pull (e.g. "apify/hello-world") - or ID of the Actor to pull (e.g. "E2jjCZBezvAZnX8Rb") You can find both by clicking on the Actor title at the top of the page, which will open a modal containing both Actor unique name and Actor ID. This command will copy the Actor into the current directory on your local machine. bash apify pull <ActorId> ## Documentation reference To learn more about Apify and Actors, take a look at the following resources: - Apify SDK for JavaScript documentation - Apify SDK for Python documentation - Apify Platform documentation - Join our developer community on Discord

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Enhanced Deep Content Crawler now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
assertive_analogy
Pricing
Paid
Total Runs
406
Active Users
25
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support