RSS / XML Scraper
by shahidirfan
Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even th...
Opens on Apify.com
About RSS / XML Scraper
Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.
What does this actor do?
RSS / XML Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
RSS/XML Scraper
Extract structured data from RSS/Atom feeds and websites with automatic feed discovery 🚀 Run on Apify • 📖 Documentation • 💬 Community --- ## 📋 What does this Actor do? This Actor scrapes RSS/Atom feeds and extracts structured data from feed entries. It automatically discovers RSS feeds from websites and can optionally extract full article content. All extracted data is stored in the Apify dataset for easy processing and analysis. ### ✨ Key Features - 📡 Feed Scraping: Extract data from RSS/Atom feeds - 🔍 Auto Discovery: Find RSS feeds automatically from websites - 📄 Full Content: Optional extraction of complete article content - ⚡ Fast Processing: Asynchronous processing for high performance - 🎯 Structured Data: Clean, structured output in JSON format - 🔧 Flexible Input: Support for multiple URL formats and input methods ## 📥 Input The Actor accepts various input formats to accommodate different use cases. ### Input Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | urls | string or array | Required | - | RSS feed URLs, website URLs, or both. Supports multiple formats:
• Single URL: "https://example.com/feed.xml"
• Multi-line: One URL per line
• Comma-separated: "url1,url2,url3"
• JSON array: ["url1", "url2"] | | extractContent | boolean | Optional | false | Extract full article content from feed entry links | | maxEntries | number | Optional | 0 | Maximum entries to process per feed (0 = all entries) | | discoverFeeds | boolean | Optional | false | Automatically discover RSS feeds from website URLs | | userAgent | string | Optional | - | Custom user agent string for HTTP requests | | timeout | number | Optional | 30 | Request timeout in seconds | | concurrency | number | Optional | 5 | Maximum number of feeds/websites processed in parallel | ### Legacy Parameters (for backward compatibility) | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | rss_url | string | Optional | Single RSS feed URL (alternative to urls) | | xml_url | string | Optional | Single XML feed URL (alternative to urls) | ## 📤 Output The Actor outputs structured JSON data to the Apify dataset. Data is available in multiple views for different analysis needs. ### Data Structure Each processed entry contains the following fields: json { "feed_url": "https://example.com/feed.xml", "title": "Article Title", "link": "https://example.com/article", "description": "Article description or summary", "author": "John Doe", "published": "2025-11-08T10:30:00+00:00", "id": "unique-entry-identifier", "tags": ["tag1", "tag2"], "collected_at": "2025-11-08T12:00:00+00:00" } ### Additional Fields (when extractContent: true) json { "full_text": "Complete article text content...", "full_html": "<p>Complete article HTML...</p>", "keywords": ["keyword1", "keyword2"], "top_image": "https://example.com/image.jpg", "authors": ["Author Name"], "publish_date": "2025-11-08T10:30:00+00:00", "meta_description": "Article meta description" } ### Dataset Views The dataset provides multiple views for different analysis needs: - 📊 Overview: Complete entry data with all fields - 📰 Feeds: Feed-level information and metadata - 📝 Articles: Article content and extracted data ## 🚀 Usage Examples ### Basic RSS Feed Scraping json { "urls": "https://example.com/feed.xml" } ### Multiple Feeds json { "urls": [ "https://blog1.com/feed.xml", "https://blog2.com/rss", "https://news.com/atom.xml" ] } ### Website Feed Discovery json { "urls": "https://example.com", "discoverFeeds": true } ### Full Content Extraction json { "urls": "https://tech-news.com/feed.xml", "extractContent": true, "maxEntries": 50 } ### Advanced Configuration json { "urls": "https://example.com/feed.xml", "extractContent": true, "maxEntries": 100, "discoverFeeds": false, "userAgent": "Custom Bot/1.0", "timeout": 60 } ### Legacy Input Format json { "rss_url": "https://example.com/feed.xml", "extractContent": true } ## 💰 Cost & Performance ### Compute Units - Free: 1,000 entries per month - Paid: $0.25 per 1,000 entries ### Performance - Typical Speed: 100-500 entries per minute - Concurrent Processing: Multiple feeds processed simultaneously - Memory Usage: ~50MB base + ~10MB per active feed ## ⚠️ Limits & Quotas - Maximum URLs: 100 URLs per run - Maximum Entries: 10,000 entries per feed (configurable) - Request Timeout: 300 seconds maximum - Rate Limiting: Automatic handling of rate limits - File Size: No limit on extracted content ## 🛠️ Troubleshooting ### Common Issues "No feeds found" - Check if the URL is accessible - Verify the URL points to a valid RSS/Atom feed - Use discoverFeeds: true for website URLs "Content extraction failed" - Some websites block automated access - Try with a custom userAgent - Check if the article URL is still valid "Timeout errors" - Increase the timeout parameter - Reduce maxEntries for large feeds - Check network connectivity ### Error Handling The Actor automatically handles: - Network timeouts and retries - Invalid URLs and feeds - Malformed content - Rate limiting from websites ## 📚 Resources - 📖 Apify Platform Documentation - 🎯 Apify Console - 💬 Community Forum - 🆘 Support --- Built with ❤️ for the Apify platform Extract RSS data effortlessly • Automate content aggregation • Power your applications
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try RSS / XML Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- shahidirfan
- Pricing
- Paid
- Total Runs
- 63
- Active Users
- 11
Related Actors
Web Scraper
by apify
Cheerio Scraper
by apify
Website Content Crawler
by apify
Legacy PhantomJS Crawler
by apify
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support