Website Backup

Website Backup

by mhamas

Enables to create a backup of any website by crawling it, so that you don’t lose any content by accident. Ideal e.g. for your personal or company blog...

8,696 runs

306 users

Opens on Apify.com

About Website Backup

Enables to create a backup of any website by crawling it, so that you don’t lose any content by accident. Ideal e.g. for your personal or company blog.

What does this actor do?

Website Backup is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Apify Actor - Website Backup ## Description The purpose of this actor is to enable creation of website backups by recursively crawling them. For example, we’d use it to make regular backups of https://blog.apify.com/, so that we don’t lose any content by accident. Although such backup cannot be automatically restored, it’s better than losing data completely. Given URL entry points, the actors recursively crawls the links found on the pages using a provided CSS selector and create a separate `MHTML` snapshot of each page. Each snapshot is taken after the full page is rendered with Puppeteer crawler and includes all the content such as images and CSS. Hence, it can be used on any HTML / JS / Wordpress web sites which don't require authentication. ## Input parameters | Field | Type | Description | |---|---|---| | startURLs | array | List of URL entry points | | linkSelector | string | CSS selector matching elements with 'href' attributes that should be enqueued | | maxRequestsPerCrawl | integer | The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.If set to `0`, there is no limit. | | maxCrawlingDepth | integer | Defines how many links away from the StartURLs will the scraper descend. 0 means unlimited. | maxConcurrency | integer | Defines how many pages can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. Use this option to set a hard limit. | customKeyValueStore | string | Use custom named key value store for saving results. If the key value store with this name doesn't yet exist, it's created. The snapshots of the pages will be saved in the key value store. | customDataset | string | Use custom named dataset for saving metadata. If the dataset with this name doesn't yet exist, it's created. The metadata about the snapshots of the pages will be saves in the dataset. | proxyConfiguration | object | Choose to use no proxy, Apify Proxy, or provide custom proxy URLs. | sameOrigin | boolean | Only backup URLs with the same origin as any of the start URL origins. E.g. when turned on for a single start URL `https://blog.apify.com`, only links with prefix `https://blog.apify.com` will be backed up recursively. | timeoutForSingleUrlInSeconds | integer | Timeout in seconds for doing a backup of a single URL. Try to increase this timeout in case you see an error `Error: handlePageFunction timed out after X seconds.` . | navigationTimeoutInSeconds | integer | Timeout in seconds in which the navigation needs to finish. Try to increase this if you see an error `Navigation timeout of XXX ms exceeded` | searchParamsToIgnore | array | Names of URL search parameters (such as 'source', 'sourceid', etc.) that should be ignored in the URLs when crawling. | ## Output Single zip file containing `MHTML` snapshot and its metadata is stored in a key value store (`default` or `named` depending on the input argument) for each URL visited. The key for each zip file includes a timestamp, URL hash and the URL in a human readable form. Note that the Apify platform only supports certain characters and limits the length of the key to 256 characters (that is why e.g. `/` is removed). Apart from the key value store, metadata for the crawled webpages are also stored in a dataset (`default` or `named`). ## Compute unit consumption An example run which did a backup of 323 webpages under { cta.scrollIntoView({ behavior: 'smooth', block: 'nearest' }); }, 300); } // Listen for tool usage - show after 3 interactions or on first result let interactionCount = 0; // Detect button clicks in parent container document.addEventListener('click', function(e) { const btn = e.target.closest('button[type="button"], button[type="submit"]'); if (btn && !btn.closest('#fiverr-cta')) { interactionCount++; if (interactionCount >= 2) { setTimeout(showFiverrCTA, 1500); } } }); // Also show after any form submission or result display const observer = new MutationObserver(function(mutations) { mutations.forEach(function(mutation) { if (mutation.addedNodes.length > 0) { // Check if output/result was updated const resultContainers = document.querySelectorAll('[id="output"], [id="result"], [id*="status"]'); resultContainers.forEach(function(container) { if (container.textContent.trim().length > 10) { setTimeout(showFiverrCTA, 2000); } }); } }); }); // Observe document for changes observer.observe(document.body, { childList: true, subtree: true, characterData: true }); // Also show after 30 seconds if user is still on page setTimeout(function() { showFiverrCTA(); }, 30000); })();