RSS / XML Scraper

by shahidirfan

Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even th...

63 runs

11 users

Opens on Apify.com

About RSS / XML Scraper

Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.

What does this actor do?

RSS / XML Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

RSS/XML Scraper
Extract structured data from RSS/Atom feeds and websites with automatic feed discovery 🚀 Run on Apify • 📖 Documentation • 💬 Community
--- ## 📋 What does this Actor do? This Actor scrapes RSS/Atom feeds and extracts structured data from feed entries. It automatically discovers RSS feeds from websites and can optionally extract full article content. All extracted data is stored in the Apify dataset for easy processing and analysis. ### ✨ Key Features - 📡 Feed Scraping: Extract data from RSS/Atom feeds - 🔍 Auto Discovery: Find RSS feeds automatically from websites - 📄 Full Content: Optional extraction of complete article content - ⚡ Fast Processing: Asynchronous processing for high performance - 🎯 Structured Data: Clean, structured output in JSON format - 🔧 Flexible Input: Support for multiple URL formats and input methods ## 📥 Input The Actor accepts various input formats to accommodate different use cases. ### Input Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `urls` | `string` or `array` | Required | - | RSS feed URLs, website URLs, or both. Supports multiple formats:
• Single URL: `"https://example.com/feed.xml"`
• Multi-line: One URL per line
• Comma-separated: `"url1,url2,url3"`
• JSON array: `["url1", "url2"]` | | `extractContent` | `boolean` | Optional | `false` | Extract full article content from feed entry links | | `maxEntries` | `number` | Optional | `0` | Maximum entries to process per feed (0 = all entries) | | `discoverFeeds` | `boolean` | Optional | `false` | Automatically discover RSS feeds from website URLs | | `userAgent` | `string` | Optional | - | Custom user agent string for HTTP requests | | `timeout` | `number` | Optional | `30` | Request timeout in seconds | | `concurrency` | `number` | Optional | `5` | Maximum number of feeds/websites processed in parallel | ### Legacy Parameters (for backward compatibility) | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `rss_url` | `string` | Optional | Single RSS feed URL (alternative to `urls`) | | `xml_url` | `string` | Optional | Single XML feed URL (alternative to `urls`) | ## 📤 Output The Actor outputs structured JSON data to the Apify dataset. Data is available in multiple views for different analysis needs. ### Data Structure Each processed entry contains the following fields: `json { "feed_url": "https://example.com/feed.xml", "title": "Article Title", "link": "https://example.com/article", "description": "Article description or summary", "author": "John Doe", "published": "2025-11-08T10:30:00+00:00", "id": "unique-entry-identifier", "tags": ["tag1", "tag2"], "collected_at": "2025-11-08T12:00:00+00:00" }` ### Additional Fields (when `extractContent: true`) `json { "full_text": "Complete article text content...", "full_html": "<p>Complete article HTML...</p>", "keywords": ["keyword1", "keyword2"], "top_image": "https://example.com/image.jpg", "authors": ["Author Name"], "publish_date": "2025-11-08T10:30:00+00:00", "meta_description": "Article meta description" }` ### Dataset Views The dataset provides multiple views for different analysis needs: - 📊 Overview: Complete entry data with all fields - 📰 Feeds: Feed-level information and metadata - 📝 Articles: Article content and extracted data ## 🚀 Usage Examples ### Basic RSS Feed Scraping `json { "urls": "https://example.com/feed.xml" }` ### Multiple Feeds `json { "urls": [ "https://blog1.com/feed.xml", "https://blog2.com/rss", "https://news.com/atom.xml" ] }` ### Website Feed Discovery `json { "urls": "https://example.com", "discoverFeeds": true }` ### Full Content Extraction `json { "urls": "https://tech-news.com/feed.xml", "extractContent": true, "maxEntries": 50 }` ### Advanced Configuration `json { "urls": "https://example.com/feed.xml", "extractContent": true, "maxEntries": 100, "discoverFeeds": false, "userAgent": "Custom Bot/1.0", "timeout": 60 }` ### Legacy Input Format `json { "rss_url": "https://example.com/feed.xml", "extractContent": true }` ## 💰 Cost & Performance ### Compute Units - Free: 1,000 entries per month - Paid: $0.25 per 1,000 entries ### Performance - Typical Speed: 100-500 entries per minute - Concurrent Processing: Multiple feeds processed simultaneously - Memory Usage: ~50MB base + ~10MB per active feed ## ⚠️ Limits & Quotas - Maximum URLs: 100 URLs per run - Maximum Entries: 10,000 entries per feed (configurable) - Request Timeout: 300 seconds maximum - Rate Limiting: Automatic handling of rate limits - File Size: No limit on extracted content ## 🛠️ Troubleshooting ### Common Issues "No feeds found" - Check if the URL is accessible - Verify the URL points to a valid RSS/Atom feed - Use `discoverFeeds: true` for website URLs "Content extraction failed" - Some websites block automated access - Try with a custom `userAgent` - Check if the article URL is still valid "Timeout errors" - Increase the `timeout` parameter - Reduce `maxEntries` for large feeds - Check network connectivity ### Error Handling The Actor automatically handles: - Network timeouts and retries - Invalid URLs and feeds - Malformed content - Rate limiting from websites ## 📚 Resources - 📖 Apify Platform Documentation - 🎯 Apify Console - 💬 Community Forum - 🆘 Support ---
Built with ❤️ for the Apify platform Extract RSS data effortlessly • Automate content aggregation • Power your applications

Categories

DEVELOPER_TOOLS AUTOMATION NEWS

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try RSS / XML Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: shahidirfan
Pricing: Paid
Total Runs: 63
Active Users: 11

Related Actors

Web Scraper

Web Scraper

by apify

Cheerio Scraper

Cheerio Scraper

by apify

Website Content Crawler

Website Content Crawler

by apify

Legacy PhantomJS Crawler

Legacy PhantomJS Crawler

by apify

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support