article-scrapper

by credible_sandal

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefi...

48 runs
10 users
Try This Actor

Opens on Apify.com

About article-scrapper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs

What does this actor do?

article-scrapper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Tech News Article Scraper A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs provided by you. ## Features - Universal Scraping: Generic scraper works with most tech news sites - Preset Sites: Pre-configured settings for popular tech news sources - Custom URLs: Scrape any tech news site by providing URLs - Smart Extraction: Automatically detects article content, titles, authors, dates, images, and tags - Advanced Filtering: Filter by keywords, title text, or description content - Flexible Configuration: Control article count, pagination, and content inclusion - Error Handling: Robust retry logic and graceful error handling - Rate Limiting: Built-in delays to respect website resources ## Supported Preset Sites The following sites are pre-configured for easy scraping: - The Verge - Tech news and media - CNET - Tech product reviews and news - Wired - Technology and culture - TechCrunch - Startup and technology news - Ars Technica - Technology news and analysis - Engadget - Consumer electronics and gadgets - The Guardian Tech - Technology news from The Guardian - The Next Web - International technology news ## How It Works This actor uses a generic scraper that adapts to different news site structures: 1. Detects Article Links: Uses multiple strategies to find article URLs on listing pages - Looks for <article> tags - Searches headings (h1, h2, h3) for links - Finds elements with article/post/story classes - Identifies URLs with date patterns (/2024/, /2025/) 2. Extracts Content: Uses fallback strategies for each field - Title: h1 tag → og:title → twitter:title → title tag - Author: rel="author" → author classes → itemprop="author" → meta tags - Date: time[datetime] → datePublished → meta tags - Content: article tag → articleBody → content classes - Summary: og:description → meta description → intro paragraph - Image: og:image → twitter:image → first article image - Tags: rel="tag" → category links → tag classes 3. Handles Edge Cases: Normalizes URLs, filters non-articles, removes duplicates ## Usage Examples ### Example 1: Scrape Latest Articles from The Verge json { "usePresets": true, "presetSources": ["verge"], "maxArticlesPerSource": 20, "maxPages": 1, "includeContent": true } ### Example 2: Scrape Multiple Tech Sites (Light Mode) Get summaries without full content to save time: json { "usePresets": true, "presetSources": ["verge", "techcrunch", "wired", "arstechnica"], "maxArticlesPerSource": 10, "maxPages": 1, "includeContent": false } ### Example 3: Scrape Custom Tech Blogs json { "usePresets": false, "customUrls": [ "https://www.theverge.com", "https://news.ycombinator.com", "https://9to5mac.com" ], "maxArticlesPerSource": 15, "maxPages": 2, "includeContent": true } ### Example 4: Deep Scrape a Single Site Get many articles from one source: json { "usePresets": false, "customUrls": ["https://techcrunch.com"], "maxArticlesPerSource": 50, "maxPages": 5, "includeContent": true } ### Example 5: Search for Specific Topics (AI Articles) Filter articles by keywords in title or description: json { "usePresets": true, "presetSources": ["verge", "techcrunch", "wired"], "maxArticlesPerSource": 20, "searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "GPT-4"] } ### Example 6: Filter by Title Only scrape articles with "iPhone" in the title: json { "usePresets": true, "presetSources": ["verge", "cnet"], "maxArticlesPerSource": 15, "titleContains": "iPhone" } ### Example 7: Filter by Description Only scrape articles about a specific topic in the description: json { "usePresets": false, "customUrls": ["https://techcrunch.com"], "maxArticlesPerSource": 25, "descriptionContains": "startup funding" } ### Example 8: Combine Multiple Filters Search for AI articles with "OpenAI" in the title: json { "usePresets": true, "presetSources": ["verge", "techcrunch", "arstechnica"], "maxArticlesPerSource": 30, "searchKeywords": ["AI", "ChatGPT"], "titleContains": "OpenAI" } ## Input Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | usePresets | boolean | true | Use predefined sites or custom URLs | | presetSources | array | ["verge", "techcrunch", "wired"] | List of preset site keys | | customUrls | array | [] | Custom URLs to scrape (when usePresets is false) | | maxArticlesPerSource | integer | 10 | Max articles per source (1-100) | | maxPages | integer | 1 | Max listing pages to check (1-10) | | includeContent | boolean | true | Extract full article text | | searchKeywords | array | [] | Filter articles by keywords (matches ANY keyword) | | titleContains | string | "" | Only scrape articles with this text in title | | descriptionContains | string | "" | Only scrape articles with this text in description | ### Filtering Behavior - searchKeywords: Articles matching ANY of the keywords in title OR description will be included - titleContains: Only articles with this exact text (case-insensitive) in the title - descriptionContains: Only articles with this exact text (case-insensitive) in the summary/description - Combined filters: All specified filters must match (AND logic) ## Output Format Articles are saved to the dataset in the following format: json { "title": "Article Title", "url": "https://example.com/article", "author": "Author Name", "published_date": "2025-11-08T12:00:00+00:00", "content": "Full article text...", "summary": "Brief summary or excerpt", "source": "Site Name", "tags": ["tag1", "tag2"], "image_url": "https://example.com/image.jpg", "scraped_at": "2025-11-08T13:45:00+00:00" } ## Viewing Results After a successful run: 1. Dataset Tab: View all scraped articles in a table 2. Export Options: Download as JSON, CSV, XML, or Excel 3. API Access: Access results programmatically via Apify API 4. Schedule: Set up periodic scraping (hourly, daily, weekly) ## Best Practices 1. Start Small: Test with a few articles before scaling up 2. Respect Robots.txt: The scraper respects robots.txt automatically 3. Rate Limiting: Built-in delays prevent overwhelming servers 4. Monitor Logs: Check logs for warnings about missing content 5. Adjust Parameters: If scraping fails, try reducing maxArticlesPerSource ## Limitations - Some sites may use JavaScript-heavy rendering (not supported by this scraper) - Paywalled content cannot be extracted - Sites with anti-scraping measures may block requests - Content structure varies; generic extraction may miss some fields ## Troubleshooting ### No articles found - Check if the URL is correct and accessible in a browser - Verify the site allows scraping (check robots.txt) - Try increasing maxPages parameter - Some sites require JavaScript - this scraper uses static HTML only ### Missing content fields - Some sites have unique structures that the generic scraper might miss - This is normal - not all sites have all fields - The scraper uses multiple fallback strategies, but some data may be unavailable ### Connection timeouts - Check your internet connection - Some sites may be blocking automated requests - Try reducing maxArticlesPerSource to avoid overwhelming the target site ### Rate limiting errors - Reduce maxArticlesPerSource (try 5-10 instead of 50+) - Scrape fewer sources per run - Wait between runs ## FAQ Q: Can I scrape paywalled content? A: No, this scraper only accesses publicly available content. It respects the same limitations as a regular web browser. Q: How fast is the scraper? A: Approximately 1-2 articles per second, depending on site response time and content size. Built-in delays ensure polite scraping. Q: Can I scrape sites not in the preset list? A: Yes! Use the customUrls option to scrape any tech news site. Q: Will this work with JavaScript-heavy sites? A: This scraper uses static HTML parsing. For JavaScript-heavy sites (React, Vue, etc.), you may need a browser-based solution. Q: How do I schedule automated scraping? A: Use Apify's scheduler feature to run hourly, daily, or weekly. Q: Can I export data to a database? A: Yes, the scraper outputs JSON which can be easily imported into databases. On Apify, you can use integrations to automatically push data to various services. Q: Is this legal? A: Web scraping legality depends on the website's terms of service and your jurisdiction. Always: - Check the site's robots.txt - Review their Terms of Service - Don't overload their servers - Use data responsibly This tool is for educational and legitimate use cases only. ## License This project is provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.

Categories

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try article-scrapper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
credible_sandal
Pricing
Paid
Total Runs
48
Active Users
10
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support