Universal Article Scraper
by universal_scraping
Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as titl...
Opens on Apify.com
About Universal Article Scraper
Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.
What does this actor do?
Universal Article Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Universal Article Scraper A powerful web scraper that can extract articles from multiple websites simultaneously. This scraper intelligently identifies and extracts article content, metadata, and structured data from news sites, blogs, and other content platforms. ## Features - Multi-website scraping - Process multiple websites in parallel - Smart article detection - Automatically identifies article content using various heuristics - URL pattern filtering - Include/exclude URLs based on patterns - Proxy support - Built-in proxy rotation for reliable scraping - Structured output - Extracts title, content, metadata, and publication details - Rate limiting - Configurable request limits to respect website policies - Error handling - Robust error handling with retry mechanisms ## How it works The scraper processes multiple websites concurrently, following these steps for each site: 1. URL Discovery: Starts from provided seed URLs and discovers article links 2. Content Extraction: Uses Cheerio to parse HTML and extract article content 3. Data Structuring: Formats extracted data into a consistent schema 4. Storage: Saves results to Apify dataset for easy access Key components: - Smart content detection: Identifies main article content using semantic HTML tags and heuristics - Metadata extraction: Pulls publication dates, authors, categories, and other structured data - URL filtering: Respects include/exclude patterns to focus on relevant content - Concurrent processing: Handles multiple websites simultaneously for efficiency ## Input Configuration The scraper accepts a JSON input with the following structure: json { "websites": [ { "topic": "techcrunch", "urls": ["https://techcrunch.com/"], "patterns": ["**/2024/**", "**/article/**"], "ignoreUrls": [ "https://techcrunch.com/author*", "https://techcrunch.com/category*", "https://techcrunch.com/tag*" ] }, { "topic": "bbc-news", "urls": ["https://www.bbc.com/news"], "patterns": ["**/news/**"], "ignoreUrls": ["**/live/**", "**/weather/**"] }, { "topic": "theverge", "urls": ["https://www.theverge.com/"], "patterns": [], "ignoreUrls": [] } ], "maxRequestsPerCrawl": 100 } ### Configuration Fields #### websites (required) An array of website objects to scrape. Each website object contains: - topic (string, required): A unique identifier for the website (used for labeling results) - urls (array, required): Starting URLs to begin crawling from - patterns (array, optional): URL patterns to include (glob patterns supported) - Example: ["**/article/**", "**/news/**"] - only scrape URLs containing "/article/" or "/news/" - Leave empty [] to include all discovered URLs - ignoreUrls (array, optional): URL patterns to exclude (glob patterns supported) - Example: ["**/author/**", "**/category/**"] - skip author pages and category pages - Useful for avoiding non-article pages like navigation, archives, etc. #### maxRequestsPerCrawl (number, optional) Maximum number of requests per website (default: 100). Controls how many pages to scrape from each website to prevent infinite crawling. ### Output Scraped articles are stored in the Apify dataset. Each article contains: #### Core Fields - url - Source URL where the article was scraped from - loadedUrl - Final loaded URL (may differ from original due to redirects) - baseUrl - Base URL of the website - articleText - Main article content (minimum 300 characters required) - title - Article headline - topic - Website topic identifier from input configuration #### Metadata Fields - publishDate - Publication date as Date object (parsed from publishDateString) - publishDateString - Raw publication date string as found on the page - modifiedDate - Last modified date as Date object (if available) - author - Author name - description - Article description/summary - canonicalUrl - Canonical URL specified by the page #### Content Classification - type - Content type (e.g., "article") - section - Article section/category - tags - Array of article tags - keywords - Article keywords #### Media & SEO - imageUrl - Featured image URL - imageAlt - Alt text for featured image - robots - Robots meta tag value Note: Empty fields are automatically removed from the output. Articles shorter than 300 characters are filtered out.
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Universal Article Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- universal_scraping
- Pricing
- Paid
- Total Runs
- 220
- Active Users
- 33
Related Actors
Smart Article Extractor
by lukaskrivka
Google Search
by devisty
Twitter Tweets Scraper
by gentle_cloud
Twitter Profile
by danek
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support