Newsletter Scraper

Newsletter Scraper

by benthepythondev

Scrape newsletter archives from Substack, Beehiiv, and Ghost. Get clean markdown, metadata, images, and AI-ready token counts for content research and model training.

117 runs
9 users
Try This Actor

Opens on Apify.com

About Newsletter Scraper

Need to get your hands on the raw content from top newsletter platforms? This actor scrapes newsletter archives from Substack, Beehiiv, and Ghost, pulling everything you need for serious data work. It grabs the full text and delivers it in clean markdown, which is perfect for feeding into other systems. You also get all the metadata, any images embedded in the posts, and practical metrics like word and token counts. I use the token counts specifically to gauge costs before running batches through language models. It’s become a go-to for my own content research, letting me analyze trends and see what competitors are publishing without manual copying. For anyone building or training an AI model, having this structured, clean output is a huge time-saver over building a scraper from scratch. Just point it at the newsletter URL you need, and it handles the extraction, organizing the content in a way that’s immediately useful.

What does this actor do?

Newsletter Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Newsletter Scraper

Extracts complete archives from Substack, Beehiiv, and Ghost newsletters. Outputs clean, structured data including post content, metadata, and engagement metrics, formatted for AI training or content analysis.

Overview

This actor scrapes newsletter platforms to provide full post histories. It outputs content in multiple formats (Markdown, HTML, plain text) alongside calculated metrics like word and token counts. It's designed for use cases like building LLM training datasets, content research, or competitive analysis.

Key Features

  • Platform Support: Works with Substack, Beehiiv, and Ghost newsletters.
  • Complete Data Extraction:
    • Full post content (title, body, author, publication date).
    • Newsletter metadata (name, description, author info).
    • Engagement metrics (likes, comments) where available.
    • Image data (URLs, alt text, dimensions).
    • Paywall/premium content detection.
  • LLM-Optimized Output: Provides clean Markdown and estimates token counts (GPT-style).
  • Flexible Scraping Modes: Choose to scrape the full archive, a single post, or just newsletter info.
  • Rate-Limiting: Configurable delay between requests to respect target sites.

How to Use

Input Configuration

  1. Newsletter URL: Provide the homepage URL of the target newsletter (e.g., https://newsletter.substack.com).
  2. Scrape Mode:
    • FULL_ARCHIVE: Gets the entire post history.
    • SINGLE_POST: Extracts one specific post URL.
    • NEWSLETTER_INFO: Fetches only metadata (name, description, author).
  3. Output Format: Choose MARKDOWN (recommended for AI), HTML, PLAIN_TEXT, or ALL.
  4. Optional Settings:
    • maxPosts: Limit the number of posts scraped (0 for no limit).
    • includeImages: Toggle image data extraction.
    • includeMetadata: Toggle engagement metrics.
    • postsSinceDate: Scrape only posts published after this date.
    • delayBetweenRequests: Set a pause (in seconds) between page requests.

Execution

Run the actor with your configured input. It will:
1. Validate the provided URL.
2. Extract post links from the archive.
3. Scrape content and metadata from each post.
4. Process data into the chosen output format(s).
5. Save the results to the dataset.

Input/Output

Input Example (JSON):

{
  "newsletterUrl": "https://example.beehiiv.com",
  "scrapeMode": "FULL_ARCHIVE",
  "outputFormat": "MARKDOWN",
  "maxPosts": 50,
  "includeImages": true
}

Output Structure:
The dataset contains an array of post objects. Each object includes:

Field Description Example
title Post title. "10 Product Lessons"
url Direct link to the post. https://newsletter.com/p/post-slug
author Post author name. "Jane Doe"
published_date ISO 8601 publication timestamp. 2025-01-15T10:00:00Z
content_markdown Main content in LLM-ready Markdown. "# Header\n\nArticle text..."
content_html Original HTML content. "<article>...</article>"
content_text Plain text version. "Article text..."
word_count Total word count. 2450
token_count Estimated LLM tokens (~GPT-4). 3200
images Array of image objects with url, width, height, alt. [{...}]
is_premium Paywall status. false
metadata Engagement statistics (platform-dependent). {"likes": 245}

Results are accessible via the Apify platform's Dataset tab in clean JSON format.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Newsletter Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
benthepythondev
Pricing
Paid
Total Runs
117
Active Users
9
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support