Google News Scraper

Name: Google News Scraper
Author: xmolodtsov

by xmolodtsov

Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallb...

103 runs

6 users

Try This Actor

Opens on Apify.com

About Google News Scraper

Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallbacks. Production-ready & cost-optimized

What does this actor do?

Google News Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Google News Scraper A streamlined and efficient Apify actor that scrapes Google News articles with full text extraction and intelligent content processing. Optimized for production use with unified architecture, cost-efficient operations, and smart content extraction. ✅ Fully optimized and production-ready! ## 🚀 Features ### Core Functionality - 🔍 Flexible Search: Search by keywords, regions, languages, and date ranges - 📰 Full Text Extraction: Real article content from Google News RSS feeds with HTML descriptions - 🌍 Multi-Region Support: Search across different countries and languages - 🤖 Smart Google News Handling: Automatic detection and processing of Google News URLs - 📊 Rich Metadata: Titles, sources, dates, images, tags, and complete article information - ⚡ High Success Rate: 100% success rate with intelligent fallback mechanisms ### Advanced Capabilities - 🔗 Google News URL Resolution: Intelligent handling of Google News redirect URLs - 🌐 Automatic Browser Mode: Automatically enables browser mode for Google News articles - 🛡️ Consent Page Handling: Smart detection and handling of consent pages - 🔄 Robust Error Handling: Comprehensive error recovery and retry mechanisms - 📊 Real-time Monitoring: Performance metrics and health monitoring - 🎯 RSS Feed Integration: Uses Google News RSS feeds for reliable data extraction ### Quality & Reliability - ✅ Comprehensive Testing: Unit, integration, and performance tests - 🔧 Error Recovery: Automatic recovery from network and parsing errors - 📈 Performance Optimization: Memory management and concurrent processing - 🏥 Health Monitoring: Real-time system health and error tracking - 🧹 Data Validation: Input validation and output quality assurance ## 🎉 Latest Updates (v2.0.0) Major architecture optimization! The scraper has been completely streamlined for better performance and maintainability: - ✅ Unified Architecture: Consolidated content extractors, proxy managers, and error handlers - ✅ Cost Optimized: Smart resource usage with environment-aware configuration - ✅ Simplified Codebase: Removed duplicate code and unnecessary complexity - ✅ Enhanced Performance: Faster startup and improved resource efficiency - ✅ Production Ready: Streamlined for production deployment with minimal overhead Example output: `json { "title": "Tesla awards Musk $29 billion in shares with prior pay package in limbo - CNBC", "text": "Rich HTML content with article links and descriptions...", "source": "CNBC", "publishedAt": "2025-08-05T14:08:57.000Z", "tags": ["Tesla"], "extractionSuccess": true }` ## 📋 Quick Start ### Using Apify Console 1. Visit: Apify Console 2. Search: "Google News Scraper" 3. Configure: Set your search parameters 4. Run: Start the actor and monitor progress ### Using Apify CLI `bash # Install Apify CLI npm install -g apify-cli # Run the actor apify call google-news-scraper --input '{ "query": "Tesla", "region": "US", "language": "en-US", "maxItems": 3 }'` ### Using Apify API `javascript import { ApifyApi } from 'apify-client'; const client = new ApifyApi({ token: 'YOUR_API_TOKEN' }); const run = await client.actor('google-news-scraper').call({ query: 'climate change', region: 'US', maxItems: 50 }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items);` ## ⚙️ Configuration ### Input Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `query` | string | ✅ | "technology" | Search query for Google News | | `region` | string | ❌ | "US" | Region code (US, GB, DE, FR, etc.) | | `language` | string | ❌ | "en-US" | Language code (en-US, de-DE, fr-FR, etc.) | | `maxItems` | number | ❌ | 3 | Maximum articles to scrape (0 = unlimited, ~100-200 max from RSS) | | `dateFrom` | string | ❌ | - | Start date for articles (YYYY-MM-DD format) | | `dateTo` | string | ❌ | - | End date for articles (YYYY-MM-DD format) | | `browserProxyGroups` | array | ❌ | ["RESIDENTIAL", "country-US"] | Proxy groups for browser-based resolution | ### Content Extraction The scraper uses an intelligent multi-strategy approach: - HTTP-first resolution: Tries efficient HTTP methods before browser automation - Automatic browser fallback: Uses Playwright for JavaScript-heavy sites when needed - Multi-strategy extraction: Readability, schema.org, custom selectors, and heuristics - Quality validation: Articles must have 300+ characters and at least one valid image - Consent handling: Automatic detection and bypass of consent pages ### Regional Support | Region | Code | Language | Example Query | |--------|------|----------|---------------| | United States | US | en-US | Technology news | | United Kingdom | GB | en-GB | Brexit updates | | Germany | DE | de-DE | Klimawandel | | France | FR | fr-FR | Intelligence artificielle | | Japan | JP | ja-JP | 人工知能 | | Australia | AU | en-AU | Bushfire news | ## 📊 Output Format ### Article Structure json { "title": "Revolutionary AI Breakthrough in Healthcare", "url": "https://example.com/ai-healthcare-breakthrough", "text": "Full article content with comprehensive details...", "description": "Scientists develop AI system that can diagnose diseases...", "author": "Dr. Jane Smith", "publishedDate": "2024-01-15T14:30:00Z", "source": "TechNews Daily", "sourceUrl": "https://technews.com", "images": [ "https://example.com/images/ai-healthcare.jpg", "https://example.com/images/doctor-ai.png" ], "extractionSuccess": true, "extractionMethod": "unfluff", "metadata": { "wordCount": 1250, "readingTime": "5 min", "language": "en", "contentQuality": 0.95 }, "scrapedAt": "2024-01-15T15:00:00Z" } ### Metadata Fields | Field | Type | Description | |-------|------|-------------| | `wordCount` | number | Number of words in article text | | `readingTime` | string | Estimated reading time | | `language` | string | Detected content language | | `contentQuality` | number | Quality score (0-1) | | `extractionMethod` | string | Method used for extraction | | `processingTime` | number | Time taken to process (ms) | ## 🔧 Development ### Local Development Setup `bash # Clone the repository git clone https://github.com/your-username/google-news-scraper cd google-news-scraper # Install dependencies npm install # Set up development environment npm run dev:setup # Start development mode npm run dev` ### Testing `bash # Run all tests npm test # Run development tests npm run dev:test # Run test scenarios npm run dev:scenarios # Check environment health npm run dev:health` ### Monitoring `bash # Real-time monitoring npm run monitor # View logs npm run logs # Health check npm run dev:health` For detailed development information, see DEV_README.md. ## 📚 Documentation - API Reference: Detailed API documentation - Configuration Guide: Complete configuration options - Developer Guide: Technical documentation - Troubleshooting: Common issues and solutions - Examples: Practical usage examples ## 🔍 Use Cases ### News Monitoring `javascript // Monitor breaking news { "query": "breaking news", "region": "US", "maxItems": 10 }` ### Market Research `javascript // Track industry trends { "query": "artificial intelligence startup funding", "region": "US", "maxItems": 50 }` ### Content Analysis `javascript // Analyze sentiment and topics { "query": "climate change policy", "region": "GB", "language": "en-GB", "maxItems": 100 }` ## ⚡ Performance ### Benchmarks - Processing Speed: ~50 articles per minute - Memory Usage: <512MB for 1000 articles - Success Rate: >95% with retry logic - Concurrent Requests: Up to 10 simultaneous ### Optimization Tips 1. Use appropriate maxItems: Don't request more than needed 2. Enable proxy rotation: For high-volume scraping 3. Set reasonable delays: Respect rate limits 4. Monitor performance: Use built-in monitoring tools ## 🛡️ Error Handling ### Automatic Recovery - Network Errors: Exponential backoff retry - Rate Limiting: Automatic delay adjustment - Consent Pages: Automatic bypass strategies - Content Extraction: Multiple fallback methods - Circuit Breakers: Prevent cascade failures ### Error Types - Retryable: Network timeouts, rate limits, temporary failures - Non-retryable: Invalid inputs, authentication errors - Recoverable: Partial content extraction, image validation failures ## 📈 Monitoring & Analytics ### Built-in Metrics - Request success/failure rates - Response times and performance - Memory usage and optimization - Error classification and trends - Content extraction quality ### Health Monitoring - Real-time system health - Circuit breaker status - Resource utilization - Error rate thresholds ## 🤝 Contributing We welcome contributions! Please see our Contributing Guide for details. ### Development Workflow 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests 5. Submit a pull request ## 📄 License This project is licensed under the Apache License 2.0 - see the LICENSE file for details. ## 🆘 Support - Issues: GitHub Issues - Discussions: GitHub Discussions - Email: support@example.com ## 🏆 Acknowledgments - Built with Apify SDK - Content extraction powered by Unfluff - XML parsing by fast-xml-parser - Web scraping with Crawlee

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Google News Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: xmolodtsov
Pricing: Paid
Total Runs: 103
Active Users: 6

Related Actors

Smart Article Extractor

by lukaskrivka

Google Search

by devisty

Twitter Tweets Scraper

by gentle_cloud

Twitter Profile

by danek

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support