Universal Article Scraper

Name: Universal Article Scraper
Author: universal_scraping

by universal_scraping

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as titl...

220 runs

33 users

Try This Actor

Opens on Apify.com

About Universal Article Scraper

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

What does this actor do?

Universal Article Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Universal Article Scraper A powerful web scraper that can extract articles from multiple websites simultaneously. This scraper intelligently identifies and extracts article content, metadata, and structured data from news sites, blogs, and other content platforms. ## Features - Multi-website scraping - Process multiple websites in parallel - Smart article detection - Automatically identifies article content using various heuristics - URL pattern filtering - Include/exclude URLs based on patterns - Proxy support - Built-in proxy rotation for reliable scraping - Structured output - Extracts title, content, metadata, and publication details - Rate limiting - Configurable request limits to respect website policies - Error handling - Robust error handling with retry mechanisms ## How it works The scraper processes multiple websites concurrently, following these steps for each site: 1. URL Discovery: Starts from provided seed URLs and discovers article links 2. Content Extraction: Uses Cheerio to parse HTML and extract article content 3. Data Structuring: Formats extracted data into a consistent schema 4. Storage: Saves results to Apify dataset for easy access Key components: - Smart content detection: Identifies main article content using semantic HTML tags and heuristics - Metadata extraction: Pulls publication dates, authors, categories, and other structured data - URL filtering: Respects include/exclude patterns to focus on relevant content - Concurrent processing: Handles multiple websites simultaneously for efficiency ## Input Configuration The scraper accepts a JSON input with the following structure: json { "websites": [ { "topic": "techcrunch", "urls": ["https://techcrunch.com/"], "patterns": ["/2024/", "/article/"], "ignoreUrls": [ "https://techcrunch.com/author", "https://techcrunch.com/category", "https://techcrunch.com/tag*" ] }, { "topic": "bbc-news", "urls": ["https://www.bbc.com/news"], "patterns": ["/news/"], "ignoreUrls": ["/live/", "/weather/"] }, { "topic": "theverge", "urls": ["https://www.theverge.com/"], "patterns": [], "ignoreUrls": [] } ], "maxRequestsPerCrawl": 100 } ### Configuration Fields #### `websites` (required) An array of website objects to scrape. Each website object contains: - `topic` (string, required): A unique identifier for the website (used for labeling results) - `urls` (array, required): Starting URLs to begin crawling from - `patterns` (array, optional): URL patterns to include (glob patterns supported) - Example: `["/article/", "/news/"]` - only scrape URLs containing "/article/" or "/news/" - Leave empty `[]` to include all discovered URLs - `ignoreUrls` (array, optional): URL patterns to exclude (glob patterns supported) - Example: `["/author/", "/category/"]` - skip author pages and category pages - Useful for avoiding non-article pages like navigation, archives, etc. #### `maxRequestsPerCrawl` (number, optional) Maximum number of requests per website (default: 100). Controls how many pages to scrape from each website to prevent infinite crawling. ### Output Scraped articles are stored in the Apify dataset. Each article contains: #### Core Fields - `url` - Source URL where the article was scraped from - `loadedUrl` - Final loaded URL (may differ from original due to redirects) - `baseUrl` - Base URL of the website - `articleText` - Main article content (minimum 300 characters required) - `title` - Article headline - `topic` - Website topic identifier from input configuration #### Metadata Fields - `publishDate` - Publication date as Date object (parsed from publishDateString) - `publishDateString` - Raw publication date string as found on the page - `modifiedDate` - Last modified date as Date object (if available) - `author` - Author name - `description` - Article description/summary - `canonicalUrl` - Canonical URL specified by the page #### Content Classification - `type` - Content type (e.g., "article") - `section` - Article section/category - `tags` - Array of article tags - `keywords` - Article keywords #### Media & SEO - `imageUrl` - Featured image URL - `imageAlt` - Alt text for featured image - `robots` - Robots meta tag value Note: Empty fields are automatically removed from the output. Articles shorter than 300 characters are filtered out.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Universal Article Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: universal_scraping
Pricing: Paid
Total Runs: 220
Active Users: 33

Related Actors

Smart Article Extractor

by lukaskrivka

Google Search

by devisty

Twitter Tweets Scraper

by gentle_cloud

Twitter Profile

by danek

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support