Google News Scraper
by xmolodtsov
Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallb...
Opens on Apify.com
About Google News Scraper
Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallbacks. Production-ready & cost-optimized
What does this actor do?
Google News Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Google News Scraper A streamlined and efficient Apify actor that scrapes Google News articles with full text extraction and intelligent content processing. Optimized for production use with unified architecture, cost-efficient operations, and smart content extraction. β
Fully optimized and production-ready! ## π Features ### Core Functionality - π Flexible Search: Search by keywords, regions, languages, and date ranges - π° Full Text Extraction: Real article content from Google News RSS feeds with HTML descriptions - π Multi-Region Support: Search across different countries and languages - π€ Smart Google News Handling: Automatic detection and processing of Google News URLs - π Rich Metadata: Titles, sources, dates, images, tags, and complete article information - β‘ High Success Rate: 100% success rate with intelligent fallback mechanisms ### Advanced Capabilities - π Google News URL Resolution: Intelligent handling of Google News redirect URLs - π Automatic Browser Mode: Automatically enables browser mode for Google News articles - π‘οΈ Consent Page Handling: Smart detection and handling of consent pages - π Robust Error Handling: Comprehensive error recovery and retry mechanisms - π Real-time Monitoring: Performance metrics and health monitoring - π― RSS Feed Integration: Uses Google News RSS feeds for reliable data extraction ### Quality & Reliability - β
Comprehensive Testing: Unit, integration, and performance tests - π§ Error Recovery: Automatic recovery from network and parsing errors - π Performance Optimization: Memory management and concurrent processing - π₯ Health Monitoring: Real-time system health and error tracking - π§Ή Data Validation: Input validation and output quality assurance ## π Latest Updates (v2.0.0) Major architecture optimization! The scraper has been completely streamlined for better performance and maintainability: - β
Unified Architecture: Consolidated content extractors, proxy managers, and error handlers - β
Cost Optimized: Smart resource usage with environment-aware configuration - β
Simplified Codebase: Removed duplicate code and unnecessary complexity - β
Enhanced Performance: Faster startup and improved resource efficiency - β
Production Ready: Streamlined for production deployment with minimal overhead Example output: json { "title": "Tesla awards Musk $29 billion in shares with prior pay package in limbo - CNBC", "text": "Rich HTML content with article links and descriptions...", "source": "CNBC", "publishedAt": "2025-08-05T14:08:57.000Z", "tags": ["Tesla"], "extractionSuccess": true } ## π Quick Start ### Using Apify Console 1. Visit: Apify Console 2. Search: "Google News Scraper" 3. Configure: Set your search parameters 4. Run: Start the actor and monitor progress ### Using Apify CLI bash # Install Apify CLI npm install -g apify-cli # Run the actor apify call google-news-scraper --input '{ "query": "Tesla", "region": "US", "language": "en-US", "maxItems": 3 }' ### Using Apify API javascript import { ApifyApi } from 'apify-client'; const client = new ApifyApi({ token: 'YOUR_API_TOKEN' }); const run = await client.actor('google-news-scraper').call({ query: 'climate change', region: 'US', maxItems: 50 }); const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(items); ## βοΈ Configuration ### Input Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | query | string | β
| "technology" | Search query for Google News | | region | string | β | "US" | Region code (US, GB, DE, FR, etc.) | | language | string | β | "en-US" | Language code (en-US, de-DE, fr-FR, etc.) | | maxItems | number | β | 3 | Maximum articles to scrape (0 = unlimited, ~100-200 max from RSS) | | dateFrom | string | β | - | Start date for articles (YYYY-MM-DD format) | | dateTo | string | β | - | End date for articles (YYYY-MM-DD format) | | browserProxyGroups | array | β | ["RESIDENTIAL", "country-US"] | Proxy groups for browser-based resolution | ### Content Extraction The scraper uses an intelligent multi-strategy approach: - HTTP-first resolution: Tries efficient HTTP methods before browser automation - Automatic browser fallback: Uses Playwright for JavaScript-heavy sites when needed - Multi-strategy extraction: Readability, schema.org, custom selectors, and heuristics - Quality validation: Articles must have 300+ characters and at least one valid image - Consent handling: Automatic detection and bypass of consent pages ### Regional Support | Region | Code | Language | Example Query | |--------|------|----------|---------------| | United States | US | en-US | Technology news | | United Kingdom | GB | en-GB | Brexit updates | | Germany | DE | de-DE | Klimawandel | | France | FR | fr-FR | Intelligence artificielle | | Japan | JP | ja-JP | δΊΊε·₯η₯θ½ | | Australia | AU | en-AU | Bushfire news | ## π Output Format ### Article Structure json { "title": "Revolutionary AI Breakthrough in Healthcare", "url": "https://example.com/ai-healthcare-breakthrough", "text": "Full article content with comprehensive details...", "description": "Scientists develop AI system that can diagnose diseases...", "author": "Dr. Jane Smith", "publishedDate": "2024-01-15T14:30:00Z", "source": "TechNews Daily", "sourceUrl": "https://technews.com", "images": [ "https://example.com/images/ai-healthcare.jpg", "https://example.com/images/doctor-ai.png" ], "extractionSuccess": true, "extractionMethod": "unfluff", "metadata": { "wordCount": 1250, "readingTime": "5 min", "language": "en", "contentQuality": 0.95 }, "scrapedAt": "2024-01-15T15:00:00Z" } ### Metadata Fields | Field | Type | Description | |-------|------|-------------| | wordCount | number | Number of words in article text | | readingTime | string | Estimated reading time | | language | string | Detected content language | | contentQuality | number | Quality score (0-1) | | extractionMethod | string | Method used for extraction | | processingTime | number | Time taken to process (ms) | ## π§ Development ### Local Development Setup bash # Clone the repository git clone https://github.com/your-username/google-news-scraper cd google-news-scraper # Install dependencies npm install # Set up development environment npm run dev:setup # Start development mode npm run dev ### Testing bash # Run all tests npm test # Run development tests npm run dev:test # Run test scenarios npm run dev:scenarios # Check environment health npm run dev:health ### Monitoring bash # Real-time monitoring npm run monitor # View logs npm run logs # Health check npm run dev:health For detailed development information, see DEV_README.md. ## π Documentation - API Reference: Detailed API documentation - Configuration Guide: Complete configuration options - Developer Guide: Technical documentation - Troubleshooting: Common issues and solutions - Examples: Practical usage examples ## π Use Cases ### News Monitoring javascript // Monitor breaking news { "query": "breaking news", "region": "US", "maxItems": 10 } ### Market Research javascript // Track industry trends { "query": "artificial intelligence startup funding", "region": "US", "maxItems": 50 } ### Content Analysis javascript // Analyze sentiment and topics { "query": "climate change policy", "region": "GB", "language": "en-GB", "maxItems": 100 } ## β‘ Performance ### Benchmarks - Processing Speed: ~50 articles per minute - Memory Usage: <512MB for 1000 articles - Success Rate: >95% with retry logic - Concurrent Requests: Up to 10 simultaneous ### Optimization Tips 1. Use appropriate maxItems: Don't request more than needed 2. Enable proxy rotation: For high-volume scraping 3. Set reasonable delays: Respect rate limits 4. Monitor performance: Use built-in monitoring tools ## π‘οΈ Error Handling ### Automatic Recovery - Network Errors: Exponential backoff retry - Rate Limiting: Automatic delay adjustment - Consent Pages: Automatic bypass strategies - Content Extraction: Multiple fallback methods - Circuit Breakers: Prevent cascade failures ### Error Types - Retryable: Network timeouts, rate limits, temporary failures - Non-retryable: Invalid inputs, authentication errors - Recoverable: Partial content extraction, image validation failures ## π Monitoring & Analytics ### Built-in Metrics - Request success/failure rates - Response times and performance - Memory usage and optimization - Error classification and trends - Content extraction quality ### Health Monitoring - Real-time system health - Circuit breaker status - Resource utilization - Error rate thresholds ## π€ Contributing We welcome contributions! Please see our Contributing Guide for details. ### Development Workflow 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests 5. Submit a pull request ## π License This project is licensed under the Apache License 2.0 - see the LICENSE file for details. ## π Support - Issues: GitHub Issues - Discussions: GitHub Discussions - Email: support@example.com ## π Acknowledgments - Built with Apify SDK - Content extraction powered by Unfluff - XML parsing by fast-xml-parser - Web scraping with Crawlee
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Google News Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- xmolodtsov
- Pricing
- Paid
- Total Runs
- 103
- Active Users
- 6
Related Actors
Smart Article Extractor
by lukaskrivka
Google Search
by devisty
Twitter Tweets Scraper
by gentle_cloud
Twitter Profile
by danek
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support