Sports News Scraper

by yearning_aspect

It provides you the latest news for the sports category you chose and you love

5 runs
1 users
Try This Actor

Opens on Apify.com

About Sports News Scraper

It provides you the latest news for the sports category you chose and you love

What does this actor do?

Sports News Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Sports News Scraper An Apify Actor that scrapes the latest sports news from multiple websites based on user-selected sport categories. Perfect for sports enthusiasts, journalists, and data analysts who need aggregated sports news from multiple sources. ## Overview This actor collects sports news articles from various sports websites, processes them to remove duplicates, classifies transfer news, and outputs structured data ready for analysis or integration into your applications. It uses Cheerio for efficient HTML parsing and includes robust error handling to ensure reliable data collection even when individual sources fail. ## Features - Multi-Category Support: Scrape news for Cricket, Football, Kabaddi, Ice Hockey, Basketball, and Baseball - Multiple Sources: Aggregates news from various reputable sports websites - Transfer News Classification: Automatically identifies and classifies transfer news as rumors or confirmed - Custom Sources: Add your own websites with custom CSS selectors - Smart Deduplication: Removes duplicate articles across sources based on title similarity - Error Resilience: Continues scraping even if individual sources fail - Retry Logic: Automatic retries with exponential backoff for network errors - Structured Output: Clean, consistent JSON output saved to Apify dataset - Comprehensive Logging: Detailed logs for monitoring and debugging ## Input Configuration ### Required Parameters - categories (array): One or more sport categories to scrape - Options: cricket, football, kabaddi, ice-hockey, basketball, baseball - Example: ["cricket", "football"] ### Optional Parameters - customWebsites (array): Add custom websites to scrape - Each website requires: name, url, category, and selectors - Example: json { "name": "Custom Sports Site", "url": "https://example.com/sports", "category": "cricket", "selectors": { "article": ".article-item", "title": ".title", "link": "a", "date": ".date", "description": ".summary" } } - useOnlyCustomWebsites (boolean): If true, only scrape custom websites (default: false) - maxArticlesPerSource (integer): Maximum articles per source (default: 20, range: 1-100) ## Output Format Each scraped article includes: json { "title": "Article title", "url": "https://example.com/article", "date": "2025-11-12T10:30:00Z", "description": "Article summary or description", "source": "Source website name", "category": "cricket", "tags": ["transfer", "news"], "transferInfo": { "isTransfer": true, "status": "confirmed", "confidence": 0.9 }, "scrapedAt": "2025-11-12T12:00:00Z" } ## Usage Examples ### Basic Usage - Single Category json { "categories": ["cricket"] } ### Multiple Categories json { "categories": ["cricket", "football", "kabaddi", "ice-hockey", "basketball", "baseball"] } ### With Custom Website json { "categories": ["cricket"], "customWebsites": [ { "name": "My Cricket Site", "url": "https://mycricketsite.com/news", "category": "cricket", "selectors": { "article": ".news-item", "title": "h2", "link": "a", "date": ".publish-date" } } ] } ### Custom Websites Only json { "categories": ["football"], "useOnlyCustomWebsites": true, "customWebsites": [ { "name": "My Football Source", "url": "https://myfootball.com/news", "category": "football", "selectors": { "article": ".article", "title": ".headline", "link": "a.read-more" } } ] } ## Transfer News Classification The actor automatically detects and classifies transfer-related news: - Rumor: Articles containing keywords like "rumor", "speculation", "reported", "linked" - Confirmed: Articles with keywords like "confirmed", "official", "announced", "signed" - Unknown: Transfer news without clear classification ## Error Handling - Network errors trigger automatic retries (up to 3 attempts with exponential backoff) - Failed sources are logged but don't stop the entire scraping process - Partial results are saved even if some sources fail ## Important Notes ### Anti-Scraping Protection Many sports news websites implement anti-scraping measures that may block direct requests. When running on the Apify platform, the actor automatically benefits from Apify's infrastructure which helps with successful scraping. For best results: - Use Custom Websites: Add your own trusted sources with the customWebsites parameter - Apify Platform: The actor works best when deployed on Apify platform (better success rates than local testing) - Alternative Sources: Some default sources may be blocked - use custom sources for reliable scraping ### Recommended Approach For production use, we recommend: 1. Deploy the actor to Apify platform 2. Test with your specific custom sources 3. Monitor success rates and adjust sources as needed 4. Use websites that are more scraper-friendly (smaller news sites, RSS feeds, etc.) ## Troubleshooting ### No Results Returned Problem: Actor completes but returns no articles. Solutions: - Anti-scraping blocks: Many major sports sites block automated requests. Try using custom websites with less restrictive policies - Use Apify platform: The actor has better success rates on Apify platform than local testing - Verify that the selected categories have configured sources - Check that custom website URLs are accessible and not blocked - Review actor logs for specific error messages or network failures (403, 429 errors indicate blocking) - Ensure useOnlyCustomWebsites is not set to true without providing custom websites - Test custom selectors on the target website to ensure they match elements ### Missing Data Fields Problem: Some articles are missing date, description, or other fields. Explanation: Not all sources provide all fields. The actor handles missing fields gracefully by setting them to null. Solutions: - This is expected behavior - filter results in your application if needed - For custom websites, verify selectors are correctly targeting the desired elements - Check if the source website actually provides the missing information ### Parsing Errors Problem: Actor logs show parsing errors for specific sources. Solutions: - Website structure changes may break selectors - this is common with web scraping - For default sources, check if there's an updated version of the actor - For custom sources, inspect the website HTML and update your selectors - Use browser DevTools to test CSS selectors before adding them to configuration ### Network Timeouts Problem: Actor fails with timeout errors. Solutions: - Some websites may be slow or temporarily unavailable - The actor automatically retries failed requests up to 3 times - Consider increasing the actor's timeout setting in Apify Console - Check if the website is blocking automated requests ### Memory Limit Exceeded Problem: Actor fails with out-of-memory error. Solutions: - Reduce maxArticlesPerSource to limit memory usage - Scrape fewer categories in a single run - Increase memory allocation in Apify Console (recommended: 512MB or higher) ### Rate Limiting / Blocked Requests Problem: Websites return 429 or 403 errors. Solutions: - Some websites may block automated requests - The actor includes retry logic with exponential backoff - Consider using Apify's proxy services for better success rates - Reduce the number of concurrent requests if scraping many sources ## Performance Tips - Optimize Article Limits: Set maxArticlesPerSource to a reasonable value (10-30) for faster runs - Select Specific Categories: Only scrape categories you need to reduce runtime - Use Scheduling: Schedule regular runs to keep data fresh without manual intervention - Monitor Success Rates: Check logs to identify consistently failing sources ## Development ### Local Testing bash # Install dependencies npm install # Run locally with Apify CLI apify run # Or with Node.js directly npm start # Run tests npm test ### Project Structure apify-sports-news-scraper/ ├── .actor/ │ └── actor.json # Actor metadata and configuration ├── src/ │ ├── main.js # Entry point │ ├── config.js # Source configuration │ ├── scraper.js # Scraping logic │ ├── classifier.js # Transfer classification │ ├── processor.js # Data processing │ └── utils.js # Utility functions ├── test/ # Test files ├── INPUT_SCHEMA.json # Input validation schema ├── Dockerfile # Docker configuration ├── package.json # Dependencies └── README.md # This file ## Technical Details ### Dependencies - apify: Apify SDK for platform integration - cheerio: Fast HTML parsing - axios: HTTP client with retry support ### Retry Strategy Network requests are automatically retried up to 3 times with exponential backoff: - 1st retry: 1 second delay - 2nd retry: 2 seconds delay - 3rd retry: 4 seconds delay ### Deduplication Algorithm Articles are deduplicated based on title similarity. Articles with very similar titles (from different sources) are merged, keeping the first occurrence and combining source information. ## Limitations - The actor scrapes publicly available news only - Some websites may block automated scraping - Selector-based scraping may break if websites change their HTML structure - Transfer classification is keyword-based and may not be 100% accurate ## Support For issues, feature requests, or questions: - Check the troubleshooting section above - Review actor logs in Apify Console for detailed error messages - Submit feedback through the Apify platform - Contact the actor maintainer for custom source configurations ## License ISC

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Sports News Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
yearning_aspect
Pricing
Paid
Total Runs
5
Active Users
1
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support