CNN Article Scraper

Name: CNN Article Scraper
Author: filip_cicvarek

by filip_cicvarek

Extract CNN articles by category or search query with date filtering. Scrape news from politics, business, world, tech, sports, and more. Get structur...

157 runs

13 users

Try This Actor

Opens on Apify.com

About CNN Article Scraper

Extract CNN articles by category or search query with date filtering. Scrape news from politics, business, world, tech, sports, and more. Get structured data: title, author, publication date, full content. Perfect for media monitoring, research, and content analysis.

What does this actor do?

CNN Article Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

CNN Article Scraper Extract articles from CNN by category or search query with precise date filtering. This Actor scrapes article metadata and full content from CNN's website, making it ideal for media monitoring, content research, and data analysis. ## What does CNN Article Scraper do? This Actor retrieves articles from CNN.com based on your specified criteria: - Category-based scraping: Extract articles from specific CNN sections (politics, business, world news, etc.) - Search-based scraping: Find articles matching specific keywords or topics - Date filtering: Precisely control the publication time window - Concurrent processing: Adjust scraping speed with configurable concurrency - Structured output: Get clean, organized article data including title, author, publication date, full content, and URL ## Use Cases - Media Monitoring: Track CNN's coverage of specific topics or events over time and identify trends in news reporting - Market Research: Analyze business and technology news for competitive intelligence, industry trends, and market insights - Academic Research: Collect news articles for content analysis, sentiment studies, or media studies research projects - Content Aggregation: Build news feeds or newsletters by automatically collecting relevant CNN articles within specific timeframes - Competitive Analysis: Track how CNN covers your industry, competitors, or specific topics compared to other news sources ## Input Highlights - `category` / `searchQuery`: Provide at least one. Supplying both returns only overlapping articles. - `categoryMode`: `'latest'` (fast landing page scan), `'archive'` (monthly sitemap crawl for deep history), or `'auto'` (default heuristic based on your date window). - `archiveMonthLimit`: Caps how many monthly sitemap files are loaded when archive mode runs. Increase for longer ranges, but expect slower runs. - `maxArticles`: Set to `0` for no limit; the Actor keeps going until it exhausts collected links. - `concurrency`: Controls how many article detail pages are fetched in parallel. - Oldest-first processing: Sitemap discoveries are sorted chronologically, guaranteeing the Actor starts with the oldest articles inside your window. ### Example Input `json { "category": "world", "startDate": "2025-03-01", "endDate": "2025-10-10", "maxArticles": 100, "concurrency": 5, "categoryMode": "auto", "archiveMonthLimit": 12 }` ## Output Format Each scraped article is stored as a separate item in the dataset with the following structure: `json { "title": "Article headline", "author": "Reporter Name", "publicationDate": "2025-01-15", "updatedDate": "2025-01-20", "content": "Full article text content...", "url": "https://www.cnn.com/2025/01/15/politics/article-slug/index.html", "scrapedAt": "2025-10-10T14:30:00.000Z" }` ### Output Fields - `title`: Article headline as it appears on CNN - `author`: Article author(s) name or "Unknown" if not found - `publicationDate`: Publication date in YYYY-MM-DD format - `updatedDate`: Last updated date in YYYY-MM-DD format when available - `content`: Full article text with paragraphs separated by double line breaks - `url`: Direct link to the article on CNN.com - `scrapedAt`: ISO timestamp of when the article was scraped ## Features - ✅ Dual scraping modes: Category browsing or keyword search - ✅ Archive-aware category discovery: Navigates CNN’s live article sitemaps (with RSS fallback when sitemaps are unavailable) - ✅ Precise date filtering: Only scrapes articles within your specified date range - ✅ Early filtering optimization: Filters articles by date before scraping full content - ✅ Automatic retry logic: Handles temporary network errors with built-in retry mechanism - ✅ Concurrent processing: Adjustable parallelization for faster scraping - ✅ Clean content extraction: Filters out ads, JavaScript code, and non-article content - ✅ Structured data output: Consistent JSON format for easy integration - ✅ Duplicate prevention: Automatically removes duplicate article URLs - ✅ Pay-per-use pricing: Only pay for what you scrape - ✅ Chronological batching: Prioritises the oldest articles inside your date window so you see early coverage first ## Performance & Limits ### Speed Optimization - Concurrency: Higher concurrency speeds up scraping but uses more resources - Date filtering: Early date filtering reduces unnecessary requests - Batch processing: Articles are processed in batches based on concurrency setting - Archive mode: Sitemap downloads add latency; when CNN blocks sitemap access the Actor falls back to RSS feeds (coverage may be narrower), so reduce the date range or `archiveMonthLimit` when you only need recent content. Sitemaps supply hundreds of URLs per month, so consider lowering `maxArticles` if you only need a subset. ### Recommended Settings - For quick tests: `maxArticles: 10`, `concurrency: 1` - For moderate scraping: `maxArticles: 100`, `concurrency: 5` - For large-scale scraping: `maxArticles: 0` (unlimited), `concurrency: 10-15` - For historical digging: `categoryMode: "archive"`, widen `archiveMonthLimit` to cover every month in your range, and be prepared for longer runtimes ## Troubleshooting ### No articles found Problem: Actor completes but returns zero articles. Solutions: - For older timeframes, switch `categoryMode` to `"archive"` (or increase `archiveMonthLimit`) so the Actor scans the CNN sitemap. - Verify your date range includes actual published articles—try widening the window temporarily. - Check if the category URL structure has changed - Try using `searchQuery` instead of `category` for more reliable results ### Missing author or content Problem: Some fields return "Unknown" or empty content. Solutions: - CNN's HTML structure varies by article type. Some articles (videos, opinion pieces) may have different layouts - The Actor uses multiple selectors to extract data but cannot guarantee 100% success for all article types - Consider filtering results by checking for non-empty fields in your post-processing ### Scraping too slow Problem: Actor takes too long to complete. Solutions: - Increase `concurrency` to 10-15 for faster parallel processing - Reduce `maxArticles` if you don't need all available articles - Narrow your date range to reduce the number of articles to process ## Limitations - The Actor scrapes publicly available CNN articles only - Article structure may vary, affecting data extraction accuracy - Very old articles may have different HTML structures - Category archive filtering uses URL keywords; niche sub-sections may require a search query for full coverage - Sitemap responses can list hundreds of URLs for a single month; the Actor trims to the oldest `archiveMonthLimit` months to control runtime - CNN occasionally throttles or withholds sitemap data; in those cases the RSS fallback only exposes the stories the feeds provide - CNN may update their website structure, requiring Actor maintenance - Search API results are limited to what CNN makes available through their search service ## Support Need help or have questions about this Actor? - Open an issue in the Actor's Issues tab - Check the Apify documentation for general platform guidance - Review this README for configuration and troubleshooting tips ## Feedback If you found this Actor helpful, please leave a review on the Actor page. Your feedback helps improve the Actor and helps other users discover it. --- Pricing: This Actor uses pay-per-use pricing. You only pay for the compute resources consumed during scraping. See the Apify pricing page for current rates.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try CNN Article Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: filip_cicvarek
Pricing: Paid
Total Runs: 157
Active Users: 13

Related Actors

Smart Article Extractor

by lukaskrivka

Google Search

by devisty

Twitter Tweets Scraper

by gentle_cloud

Twitter Profile

by danek

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support