Website Extractor

Name: Website Extractor
Author: mikolabs

by mikolabs

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with al...

5 runs

2 users

Try This Actor

Opens on Apify.com

About Website Extractor

Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. For Research purposes

What does this actor do?

Website Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Scrap Any Website with Source Code Download complete websites and get them as ZIP archives. Perfect for creating offline backups, archiving websites, or downloading entire sites with all assets. Includes source code. ## Features ✅ Complete Website Downloads - Downloads entire websites with all assets and source code ✅ ZIP Archive Output - Automatically creates compressed ZIP files with full source code ✅ Configurable Depth - Control how deep to follow links (1-10 levels) ✅ Rate Limiting - Respect servers with configurable download rates ✅ Domain Filtering - Stay on same domain or follow external links ✅ Content Selection - Choose to download images, videos, or just HTML/CSS/JS ✅ Robots.txt Support - Optionally respect website's robots.txt ✅ Progress Tracking - Real-time logging of scraping progress ✅ Statistics - File counts, sizes, and compression ratios ## Input Configuration ### Required - Website URL - The URL to scrape (must include `http://` or `https://`) ### Optional | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `depth` | Integer | 2 | How many links deep to follow (1-10) | | `stayOnDomain` | Boolean | true | Only download from the same domain | | `externalDepth` | Integer | 0 | How deep to follow external links | | `connections` | Integer | 4 | Number of simultaneous downloads | | `maxRate` | Integer | 0 | Max download rate in KB/s (0 = unlimited) | | `maxSize` | Integer | 0 | Max total size in MB (0 = unlimited) | | `maxTime` | Integer | 0 | Max scraping time in seconds (0 = unlimited) | | `retries` | Integer | 2 | Number of retry attempts on error | | `timeout` | Integer | 30 | Connection timeout in seconds | | `getImages` | Boolean | true | Download image files | | `getVideos` | Boolean | true | Download video files | | `followRobots` | Boolean | true | Respect robots.txt | | `outputName` | String | null | Custom output name (auto-generated if empty) | | `cleanup` | Boolean | true | Remove source files after creating ZIP | ## Output The Actor provides two types of output: ### 1. Dataset Statistics and metadata for each scrape: `json { "url": "https://example.com", "outputName": "example.com_20241205_130000", "zipFile": "example.com_20241205_130000.zip", "fileCount": 156, "totalSize": 5242880, "zipSize": 2621440, "compressionRatio": 50.0, "timestamp": "2024-12-05T13:00:00.000Z", "config": { ... }, "status": "success" }` ### 2. Key-Value Store The complete website as a ZIP archive. Access it via: - Apify Console: Storage → Key-Value Store → [filename].zip - API: `https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip` ## Usage Examples ### Example 1: Basic Website Backup `json { "url": "https://example.com", "depth": 2, "stayOnDomain": true }` Downloads the website up to 2 levels deep, staying on the same domain. ### Example 2: Deep Archive with External Links `json { "url": "https://example.com", "depth": 5, "externalDepth": 1, "stayOnDomain": false }` Downloads 5 levels deep and follows external links 1 level. ### Example 3: Fast Scrape (HTML/CSS/JS Only) `json { "url": "https://example.com", "depth": 3, "getImages": false, "getVideos": false, "connections": 8 }` Fast scraping without images or videos, using 8 parallel connections. ### Example 4: Rate-Limited Polite Scrape `json { "url": "https://example.com", "depth": 2, "maxRate": 500, "connections": 2, "followRobots": true }` Polite scraping with rate limiting and respecting robots.txt. ### Example 5: Time-Limited Scrape `json { "url": "https://example.com", "depth": 10, "maxTime": 300, "maxSize": 100 }` Stops after 5 minutes or 100 MB, whichever comes first. ## How It Works 1. Input Validation - Validates the URL and configuration 2. HTTrack Execution - Runs HTTrack with configured parameters to download website source code 3. Progress Monitoring - Logs progress in real-time 4. Pre-ZIP Cleanup - Removes HTTrack cache files and index files before archiving 5. ZIP Creation - Creates a compressed archive of all website files and source code 6. Storage - Saves ZIP to Key-Value Store and stats to Dataset 7. Post-ZIP Cleanup - Optionally removes temporary files after ZIP creation ## Technical Details ### Based On - HTTrack 3.49+ - Industry-standard website copier - Python 3.11 - Modern async Python runtime - Apify SDK 2.7+ - For Actor integration and storage ### Limitations - Some JavaScript-heavy SPAs may not download completely - Websites with aggressive bot protection may block scraping - Dynamic content loaded after page load may be missed - Maximum recommended depth is 5-6 for most websites ### Performance - Small websites (< 100 pages): 1-5 minutes - Medium websites (100-1000 pages): 5-30 minutes - Large websites (1000+ pages): 30+ minutes Performance depends on: - Website size and structure - Number of connections - Network speed - Rate limiting settings ## Legal and Ethical Considerations ⚠️ Important: Always ensure you have permission to scrape websites. - ✅ Respect `robots.txt` files (enabled by default) - ✅ Don't overload servers (use rate limiting) - ✅ Check website Terms of Service - ✅ Don't scrape copyrighted content without permission - ✅ Use reasonable connection limits (2-8) ## Troubleshooting ### Scraping Takes Too Long - Reduce `depth` to 1 or 2 - Disable `getVideos` and `getImages` - Increase `connections` (but be respectful) - Set `maxTime` or `maxSize` limits ### ZIP File Too Large - Reduce `depth` - Disable `getVideos` - Set `maxSize` limit - Use `maxTime` to stop early ### Website Blocks Scraping - Enable `followRobots` - Reduce `connections` to 2-4 - Add rate limiting with `maxRate` - Increase `timeout` if connections are slow ### Missing Content - Increase `depth` - Enable `externalDepth` if content is on other domains - Check if website uses heavy JavaScript (may not work) - Enable `getImages` and `getVideos` if needed ## Development ### Local Testing `bash # Install dependencies pip install -r requirements.txt # Run locally apify run` ### Building `bash # Build Docker image docker build -t httrack-scraper . # Run container docker run httrack-scraper` ## Support For issues or questions: - Check Actor logs for detailed error messages - Review HTTrack documentation: https://www.httrack.com/ - Contact Apify support through the platform ## License This Actor uses HTTrack, which is licensed under GPL v3. ## Version History - 1.0 - Initial release with full HTTrack integration, source code download, and ZIP archive output

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Website Extractor now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: mikolabs
Pricing: Paid
Total Runs: 5
Active Users: 2

Related Actors

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Linkedin Profile Details Scraper + EMAIL (No Cookies Required)

by apimaestro

Twitter (X.com) Scraper Unlimited: No Limits

by apidojo

Content Checker

by jakubbalada

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support