In Depth News Scraper

by sync-network

Scrape complete news articles, not just headlines. This tool extracts full-length content from top sources for research, feeds, and data analysis.

287 runs

8 users

Try This Actor

Opens on Apify.com

About In Depth News Scraper

Tired of news scrapers that only grab headlines? I've been there. The In Depth News Scraper is the actor I built my projects around because it actually pulls the full article text from major news sites. It goes beyond the snippet you see in search results, fetching the complete story so you get context, analysis, and the full narrative. You can configure it to deliver exactly what you need—whether that's a clean summary for a dashboard or the entire formatted article text for your database. I use it primarily for two things: building curated news feeds on specific topics and feeding clean, structured data into analysis tools. Instead of manually visiting dozens of sites, this automates the collection of the latest updates from top sources. The key benefit is the depth; you're working with the real content, not just metadata. This makes it reliable for monitoring brand mentions, tracking industry trends, or compiling research datasets where headlines alone are useless. Setting it up is straightforward. You define your target sources and topics, and it handles the extraction, dealing with pagination and article layouts. The output is consistent JSON you can pipe directly into other apps or data warehouses. For anyone needing substantive news content at scale, this scraper eliminates the biggest pain point: getting past the lead paragraph.

What does this actor do?

In Depth News Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

In-Depth News Scraper

An Apify actor that extracts complete news articles, not just headlines, from major categories and outlets. It provides structured, analysis-ready data.

Overview

This scraper fetches full article content from various news sources. You can filter by category, refine with keywords, and control the output detail. It's built for automation and data integration workflows where you need more than basic headlines.

Key Features

Full-Text Extraction: Gets the complete article body, not just summaries or metadata.
Category & Keyword Filtering: Target specific news categories (World, Business, Technology, etc.) and add keywords to narrow results.
Output Control: Choose between full articles or summaries via the contentLength parameter.
Structured Data: Returns consistent JSON with title, URL, date, source, content, and optional image URL.
Exclusion Filters: Use filterBadKeywords to block articles containing terms like "sponsored".
Time-Range Selection: Scrape current or historical articles.

How to Use

Configure the actor with input parameters, then run it. The dataset will contain structured article objects.

Set your target newsCategory.
Optionally add additionalKeywords to refine the search within that category.
Configure other parameters like numberOfItems or contentLength.
Execute the actor.
Download or process the resulting dataset.

Input

Configure the actor using these parameters in a JSON object:

Parameter	Type	Description
`newsCategory`	String	Required. News category (e.g., "Technology", "World").
`additionalKeywords`	String	Optional. Keywords to refine search within the category.
`numberOfItems`	Number	Articles to retrieve (default: 10, max: 100).
`filterBadKeywords`	Array	Optional. Keywords to exclude from results (e.g., `["sponsored"]`).
`contentLength`	String	`"Full"` for complete article or `"Summary"` (default: `"Full"`).
`timeRange`	String	Time period for articles (e.g., "Past week").
`retrieveImage`	Boolean	Include `imageUrl` in output (default: `false`).

Example Configuration:

{
  "newsCategory": "Technology",
  "additionalKeywords": "artificial intelligence",
  "numberOfItems": 20,
  "filterBadKeywords": ["sponsored", "advertisement"],
  "contentLength": "Full",
  "timeRange": "Past week",
  "retrieveImage": false
}

Supported Categories: World, Business, Technology, Entertainment, Health, Science, Sports, Politics.

Output

The actor outputs a dataset where each item is a structured JSON object representing one article.

{
  "title": "Article headline",
  "link": "Article URL",
  "pubDate": "2025-02-05T10:00:00.000Z",
  "source": "Publishing outlet name",
  "summary": "Brief article overview",
  "content": "Full article text (length depends on 'contentLength' setting)",
  "imageUrl": "Main image URL (if 'retrieveImage' is true)"
}

Performance & Notes

Speed: Full article extraction takes approximately 5-10 seconds per item.
Volume: Efficiently handles up to 100 articles per run. For faster results, limit numberOfItems to 50.
Reliability: Includes automatic retries for failed connections and dynamic delays to manage request rates.
Recommendations: Use specific keywords for relevant results. Disable image retrieval ("retrieveImage": false) if you don't need images to improve speed. Network conditions and source website performance can affect run time.