Newsletter Scraper
by benthepythondev
Scrape newsletter archives from Substack, Beehiiv, and Ghost. Get clean markdown, metadata, images, and AI-ready token counts for content research and model training.
Opens on Apify.com
About Newsletter Scraper
Need to get your hands on the raw content from top newsletter platforms? This actor scrapes newsletter archives from Substack, Beehiiv, and Ghost, pulling everything you need for serious data work. It grabs the full text and delivers it in clean markdown, which is perfect for feeding into other systems. You also get all the metadata, any images embedded in the posts, and practical metrics like word and token counts. I use the token counts specifically to gauge costs before running batches through language models. It’s become a go-to for my own content research, letting me analyze trends and see what competitors are publishing without manual copying. For anyone building or training an AI model, having this structured, clean output is a huge time-saver over building a scraper from scratch. Just point it at the newsletter URL you need, and it handles the extraction, organizing the content in a way that’s immediately useful.
What does this actor do?
Newsletter Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Newsletter Scraper
Extracts complete archives from Substack, Beehiiv, and Ghost newsletters. Outputs clean, structured data including post content, metadata, and engagement metrics, formatted for AI training or content analysis.
Overview
This actor scrapes newsletter platforms to provide full post histories. It outputs content in multiple formats (Markdown, HTML, plain text) alongside calculated metrics like word and token counts. It's designed for use cases like building LLM training datasets, content research, or competitive analysis.
Key Features
- Platform Support: Works with Substack, Beehiiv, and Ghost newsletters.
- Complete Data Extraction:
- Full post content (title, body, author, publication date).
- Newsletter metadata (name, description, author info).
- Engagement metrics (likes, comments) where available.
- Image data (URLs, alt text, dimensions).
- Paywall/premium content detection.
- LLM-Optimized Output: Provides clean Markdown and estimates token counts (GPT-style).
- Flexible Scraping Modes: Choose to scrape the full archive, a single post, or just newsletter info.
- Rate-Limiting: Configurable delay between requests to respect target sites.
How to Use
Input Configuration
- Newsletter URL: Provide the homepage URL of the target newsletter (e.g.,
https://newsletter.substack.com). - Scrape Mode:
FULL_ARCHIVE: Gets the entire post history.SINGLE_POST: Extracts one specific post URL.NEWSLETTER_INFO: Fetches only metadata (name, description, author).
- Output Format: Choose
MARKDOWN(recommended for AI),HTML,PLAIN_TEXT, orALL. - Optional Settings:
maxPosts: Limit the number of posts scraped (0 for no limit).includeImages: Toggle image data extraction.includeMetadata: Toggle engagement metrics.postsSinceDate: Scrape only posts published after this date.delayBetweenRequests: Set a pause (in seconds) between page requests.
Execution
Run the actor with your configured input. It will:
1. Validate the provided URL.
2. Extract post links from the archive.
3. Scrape content and metadata from each post.
4. Process data into the chosen output format(s).
5. Save the results to the dataset.
Input/Output
Input Example (JSON):
{
"newsletterUrl": "https://example.beehiiv.com",
"scrapeMode": "FULL_ARCHIVE",
"outputFormat": "MARKDOWN",
"maxPosts": 50,
"includeImages": true
}
Output Structure:
The dataset contains an array of post objects. Each object includes:
| Field | Description | Example |
|---|---|---|
title |
Post title. | "10 Product Lessons" |
url |
Direct link to the post. | https://newsletter.com/p/post-slug |
author |
Post author name. | "Jane Doe" |
published_date |
ISO 8601 publication timestamp. | 2025-01-15T10:00:00Z |
content_markdown |
Main content in LLM-ready Markdown. | "# Header\n\nArticle text..." |
content_html |
Original HTML content. | "<article>...</article>" |
content_text |
Plain text version. | "Article text..." |
word_count |
Total word count. | 2450 |
token_count |
Estimated LLM tokens (~GPT-4). | 3200 |
images |
Array of image objects with url, width, height, alt. |
[{...}] |
is_premium |
Paywall status. | false |
metadata |
Engagement statistics (platform-dependent). | {"likes": 245} |
Results are accessible via the Apify platform's Dataset tab in clean JSON format.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Newsletter Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- benthepythondev
- Pricing
- Paid
- Total Runs
- 117
- Active Users
- 9
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
🔥 Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support