Newsletter Scraper

Name: Newsletter Scraper
Author: benthepythondev

by benthepythondev

Scrape newsletter archives from Substack, Beehiiv, and Ghost. Get clean markdown, metadata, images, and AI-ready token counts for content research and model training.

117 runs

9 users

Try This Actor

Opens on Apify.com

About Newsletter Scraper

Need to get your hands on the raw content from top newsletter platforms? This actor scrapes newsletter archives from Substack, Beehiiv, and Ghost, pulling everything you need for serious data work. It grabs the full text and delivers it in clean markdown, which is perfect for feeding into other systems. You also get all the metadata, any images embedded in the posts, and practical metrics like word and token counts. I use the token counts specifically to gauge costs before running batches through language models. It’s become a go-to for my own content research, letting me analyze trends and see what competitors are publishing without manual copying. For anyone building or training an AI model, having this structured, clean output is a huge time-saver over building a scraper from scratch. Just point it at the newsletter URL you need, and it handles the extraction, organizing the content in a way that’s immediately useful.

What does this actor do?

Newsletter Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Newsletter Scraper

Extracts complete archives from Substack, Beehiiv, and Ghost newsletters. Outputs clean, structured data including post content, metadata, and engagement metrics, formatted for AI training or content analysis.

Overview

This actor scrapes newsletter platforms to provide full post histories. It outputs content in multiple formats (Markdown, HTML, plain text) alongside calculated metrics like word and token counts. It's designed for use cases like building LLM training datasets, content research, or competitive analysis.

Key Features

Platform Support: Works with Substack, Beehiiv, and Ghost newsletters.
Complete Data Extraction:
- Full post content (title, body, author, publication date).
- Newsletter metadata (name, description, author info).
- Engagement metrics (likes, comments) where available.
- Image data (URLs, alt text, dimensions).
- Paywall/premium content detection.
LLM-Optimized Output: Provides clean Markdown and estimates token counts (GPT-style).
Flexible Scraping Modes: Choose to scrape the full archive, a single post, or just newsletter info.
Rate-Limiting: Configurable delay between requests to respect target sites.

How to Use

Input Configuration

Newsletter URL: Provide the homepage URL of the target newsletter (e.g., https://newsletter.substack.com).
Scrape Mode:
- FULL_ARCHIVE: Gets the entire post history.
- SINGLE_POST: Extracts one specific post URL.
- NEWSLETTER_INFO: Fetches only metadata (name, description, author).
Output Format: Choose MARKDOWN (recommended for AI), HTML, PLAIN_TEXT, or ALL.
Optional Settings:
- maxPosts: Limit the number of posts scraped (0 for no limit).
- includeImages: Toggle image data extraction.
- includeMetadata: Toggle engagement metrics.
- postsSinceDate: Scrape only posts published after this date.
- delayBetweenRequests: Set a pause (in seconds) between page requests.

Execution

Run the actor with your configured input. It will:
1. Validate the provided URL.
2. Extract post links from the archive.
3. Scrape content and metadata from each post.
4. Process data into the chosen output format(s).
5. Save the results to the dataset.

Input/Output

Input Example (JSON):

{
  "newsletterUrl": "https://example.beehiiv.com",
  "scrapeMode": "FULL_ARCHIVE",
  "outputFormat": "MARKDOWN",
  "maxPosts": 50,
  "includeImages": true
}

Output Structure:
The dataset contains an array of post objects. Each object includes:

Field	Description	Example
`title`	Post title.	`"10 Product Lessons"`
`url`	Direct link to the post.	`https://newsletter.com/p/post-slug`
`author`	Post author name.	`"Jane Doe"`
`published_date`	ISO 8601 publication timestamp.	`2025-01-15T10:00:00Z`
`content_markdown`	Main content in LLM-ready Markdown.	`"# Header\n\nArticle text..."`
`content_html`	Original HTML content.	`"<article>...</article>"`
`content_text`	Plain text version.	`"Article text..."`
`word_count`	Total word count.	`2450`
`token_count`	Estimated LLM tokens (~GPT-4).	`3200`
`images`	Array of image objects with `url`, `width`, `height`, `alt`.	`[{...}]`
`is_premium`	Paywall status.	`false`
`metadata`	Engagement statistics (platform-dependent).	`{"likes": 245}`

Results are accessible via the Apify platform's Dataset tab in clean JSON format.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Newsletter Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: benthepythondev
Pricing: Paid
Total Runs: 117
Active Users: 9

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Newsletter Scraper

About Newsletter Scraper

What does this actor do?

Key Features

How to Use

Documentation

Newsletter Scraper

Overview

Key Features

How to Use

Input Configuration

Execution

Input/Output

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?