Reddit Scraper
by crawlerbros
Scrape entire subreddits with this crawler. Returns the posts in a subreddit along with their title, text, scores and timestamps etc.
Opens on Apify.com
About Reddit Scraper
Scrape entire subreddits with this crawler. Returns the posts in a subreddit along with their title, text, scores and timestamps etc.
What does this actor do?
Reddit Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Reddit Subreddit Scraper An Apify Actor for scraping posts from Reddit subreddits using browser automation with Playwright. ## Features - 🎯 Scrape multiple subreddits in a single run - 📊 Extract comprehensive post data (title, author, score, comments, etc.) - 🔄 Support for different sorting methods (hot, new, top, rising, controversial) - ⏰ Time filters for "top" and "controversial" posts - 📦 No authentication required for public subreddits - 💾 Data saved in structured JSON format - 🌐 Browser automation bypasses API restrictions - 🔄 Automatic pagination support ## Input Parameters The actor accepts the following input parameters: | Parameter | Type | Required | Default | Description | | ------------ | ------- | -------- | ------------ | ------------------------------------------------------------------------------------ | | subreddits | array | Yes | ["python"] | List of subreddit names to scrape (without 'r/' prefix) | | maxPosts | integer | No | 25 | Maximum number of posts to scrape from each subreddit (1-1000) | | sort | string | No | "hot" | How to sort posts: hot, new, top, rising, or controversial | | timeFilter | string | No | "day" | Time filter for 'top'/'controversial': hour, day, week, month, year, all | ### Example Input json { "subreddits": ["islamabad", "pakistan", "programming"], "maxPosts": 50, "sort": "hot", "timeFilter": "day" } ## Output Fields The actor extracts the following data for each post: ### Subreddit Information - subreddit - Subreddit name (e.g., "islamabad") - subreddit_prefixed - Subreddit name with r/ prefix (e.g., "r/islamabad") ### Post Content - post_id - Unique post ID (e.g., "1kql1t5") - post_name - Full post name in Reddit format (e.g., "t3_1kql1t5") - title - Post title - author - Username of the post author - selftext - Text content preview (first 1000 chars, for self posts only) ### Engagement Metrics - score - Post score/karma (upvotes minus downvotes) - num_comments - Number of comments on the post ### Links - url - URL of the linked content (external URL or permalink for self posts) - permalink - Direct link to the Reddit post ### Metadata - domain - Domain of the linked content (e.g., "self.islamabad" for text posts) - is_self_post - Boolean indicating if it's a text post (true) or link post (false) - link_flair - Post flair/tag text (if any) - thumbnail_url - URL of the post thumbnail image (if any) ### Timestamps - created_utc - Unix timestamp when the post was created - created_at - ISO 8601 formatted datetime (e.g., "2025-05-19T19:40:28") ### Flags - is_stickied - Boolean indicating if the post is stickied/pinned - is_locked - Boolean indicating if the post is locked (no new comments) - is_nsfw - Boolean indicating if the post is marked as NSFW (over 18) ### Example Output json { "subreddit": "islamabad", "subreddit_prefixed": "r/islamabad", "post_id": "1kql1t5", "post_name": "t3_1kql1t5", "title": "Everyone's always asking what to do in Islamabad - I made a list", "author": "hafmaestro", "selftext": "Note: I have not mentioned normal restaurants and cafes...", "score": 595, "num_comments": 101, "url": "https://old.reddit.com/r/islamabad/comments/1kql1t5/...", "permalink": "https://old.reddit.com/r/islamabad/comments/1kql1t5/...", "domain": "self.islamabad", "is_self_post": true, "link_flair": "Islamabad", "thumbnail_url": null, "created_utc": 1747683628, "created_at": "2025-05-19T19:40:28", "is_stickied": false, "is_locked": false, "is_nsfw": false } ## Usage ### Local Development 1. Install dependencies: bash pip install -r requirements.txt playwright install chromium 2. Set up input in storage/key_value_stores/default/INPUT.json: json { "subreddits": ["python"], "maxPosts": 25, "sort": "hot" } 3. Run the actor: bash python -m src 4. Check results in storage/datasets/default/ ### On Apify Platform 1. Push to Apify: - Login to Apify CLI: apify login - Initialize: apify init (if not already done) - Push to Apify: apify push 2. Or manually upload: - Create a new actor on Apify platform - Upload all files including Dockerfile, requirements.txt, and .actor/ directory 3. Configure and run: - Set input parameters in the Apify console - Click "Start" to run the actor - Download results from the dataset tab ## Technical Details ### Browser Automation - Uses Playwright with Chromium browser - Scrapes old.reddit.com for better compatibility and simpler HTML structure - Implements anti-detection measures: - Custom User-Agent headers - Disabled automation flags - Browser fingerprint masking ### Features - Automatic pagination: Clicks "next" button to load more posts - Smart selectors: Multiple fallback CSS selectors for reliability - Error handling: Screenshots saved on errors for debugging - Rate limiting: Built-in delays between requests ### Performance - Headless browser mode for efficiency - Optimized page load strategy (domcontentloaded) - Configurable wait times and timeouts ## Limitations - Only works with public subreddits - Cannot scrape private or restricted communities - Browser automation is slower than direct API calls but more reliable - Selftext preview limited to first 1000 characters ## Dependencies - apify>=2.1.0 - Apify SDK for Python - playwright~=1.40.0 - Browser automation framework - beautifulsoup4~=4.12.0 - HTML parsing library ## Troubleshooting ### Timeout Issues If you encounter timeout errors: - Check the debug screenshots in the key-value store - Increase timeout values in the code - Verify the subreddit exists and is public ### No Posts Found - Verify the subreddit name is correct (without 'r/' prefix) - Check if the subreddit has posts for the selected sort method - Review logs for detailed error messages ## License This actor is provided as-is for scraping public Reddit data in accordance with Reddit's terms of service. ## Notes - This scraper uses browser automation to access Reddit's public web interface - Always respect Reddit's robots.txt and terms of service - Use responsibly and avoid overwhelming Reddit's servers - Consider implementing additional rate limiting for large-scale scraping - The actor works best with the Apify platform's infrastructure
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Reddit Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- crawlerbros
- Pricing
- Paid
- Total Runs
- 532
- Active Users
- 86
Related Actors
🏯 Tweet Scraper V2 - X / Twitter Scraper
by apidojo
Instagram Scraper
by apify
TikTok Scraper
by clockworks
Instagram Profile Scraper
by apify
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support