Tech Docs to LLM-Ready Markdown

Name: Tech Docs to LLM-Ready Markdown
Author: hedelka

by hedelka

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LL...

41 runs

7 users

Try This Actor

Opens on Apify.com

About Tech Docs to LLM-Ready Markdown

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

What does this actor do?

Tech Docs to LLM-Ready Markdown is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Tech Docs to LLM-Ready Markdown Scraper 🚀 Convert any technical documentation site to clean, structured Markdown — ready for RAG pipelines, LLM training, and AI assistants. ## Why This Actor? While generic web scrapers dump raw HTML, this Actor is specifically designed for technical documentation: | Feature | Generic Scrapers | This Actor | |---------|-----------------|------------| | Code block preservation | ❌ Lost or broken | ✅ With language tags | | Framework-aware extraction | ❌ One-size-fits-all | ✅ Docusaurus, GitBook, MkDocs | | Navigation removal | ❌ Mixed with content | ✅ Clean content only | | RAG-ready output | ❌ Needs post-processing | ✅ `doc_id`, `section_path`, chunking | ### 🔄 Before / After

❌ Generic Scraper Output (messy HTML noise)
`Skip to main content | Docs | Community | Blog | GitHub | Search docs... | Introduction | Quick Start | Guides | ← Previous | Next → | Edit this page | Introduction Crawlee covers your crawling... Last updated 2 days ago | Was this page helpful? Yes No`

✅ This Actor Output (clean, structured Markdown)
`markdown # Introduction Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast. ## 🛠 Features - Single interface for HTTP and headless browser crawling - Persistent queue for URLs to crawl - Automatic scaling with available system resources`javascript import { PlaywrightCrawler } from 'crawlee';
📚 More real examples → (Docusaurus, MkDocs, ReadTheDocs) ## 🎯 RAG-First Output Every result includes fields optimized for vector databases and LLM loaders: `json { "doc_id": "acdb145c14f4310b", "url": "https://crawlee.dev/docs/introduction", "title": "Introduction | Crawlee", "section_path": "Guides > Quick Start > Introduction", "content": "# Introduction\n\nCrawlee covers your crawling...", "framework": "docusaurus", "chunk_index": 0, "total_chunks": 1, "metadata": { "crawledAt": "2025-12-12T03:34:46.151Z", "depth": 0, "wordCount": 358, "charCount": 2475 } }` ## Supported Documentation Frameworks | Framework | Status | Example | |-----------|--------|---------| | Docusaurus | ✅ Verified | React, Crawlee, Playwright docs | | GitBook | ✅ Verified | Many SaaS products | | MkDocs Material | ✅ Verified | Python projects | | ReadTheDocs | ✅ Verified | Sphinx documentation | | VuePress | ✅ Supported | Vue.js ecosystem | | Nextra | ✅ Supported | Next.js docs | | Generic | ✅ Fallback | Any HTML docs | ## Input Example `json { "startUrls": [{"url": "https://crawlee.dev/docs/introduction"}], "maxPages": 100, "maxDepth": 10, "enableChunking": true, "chunkSize": 2000, "outputFormat": "markdown" }` ## 🔗 LangChain Integration (Python) python from langchain.document_loaders import ApifyDatasetLoader from langchain.docstore.document import Document loader = ApifyDatasetLoader( dataset_id="YOUR_DATASET_ID", dataset_mapping_function=lambda item: Document( page_content=item["content"], metadata={ "source": item["url"], "title": item["title"], "doc_id": item["doc_id"], "section": item["section_path"] } ), ) docs = loader.load() # Ready for vectorstore! from langchain.vectorstores import Chroma vectorstore = Chroma.from_documents(docs, embeddings) ## 🦙 LlamaIndex Integration `python from llama_index.readers.apify import ApifyActor reader = ApifyActor("hedelka/tech-docs-scraper") documents = reader.load_data( run_input={"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50} ) # Build index directly index = VectorStoreIndex.from_documents(documents)` ## 📡 API Call `bash curl -X POST "https://api.apify.com/v2/acts/hedelka~tech-docs-scraper/runs?token=YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50}'` ## Use Cases 1. RAG Pipelines: Feed documentation to LangChain/LlamaIndex for "Chat with Docs" 2. LLM Fine-tuning: Create high-quality datasets from official docs 3. Knowledge Bases: Build searchable documentation archives 4. AI Assistants: Power coding assistants with up-to-date API references 5. Scheduled Updates: Keep your RAG knowledge base in sync with docs ### 📅 Scheduled Docs Updates Use Apify Scheduler to automatically re-scrape documentation and update your vector store: 1. Create a Schedule in Apify Console → Schedules 2. Set cron: `0 0 * * 0` (weekly) or `0 0 1 * *` (monthly) 3. Use a Webhook to trigger re-indexing in your RAG pipeline `json { "startUrls": [{"url": "https://docs.example.com"}], "maxPages": 500, "preset": "large-docs", "exportJsonl": true }` Your vector store always has the latest documentation! ## Pricing Pay per Result: $0.50 per 1,000 pages | Pages | Cost | |-------|------| | 100 | $0.05 | | 1,000 | $0.50 | | 10,000 | $5.00 | ## Author Built with ❤️ by HEDELKA for the LLM/RAG community. Questions? Issues? Open a GitHub issue or contact on Apify.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Tech Docs to LLM-Ready Markdown now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: hedelka
Pricing: Paid
Total Runs: 41
Active Users: 7

Related Actors

Web Scraper

by apify

Cheerio Scraper

by apify

Website Content Crawler

by apify

Legacy PhantomJS Crawler

by apify

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support

Tech Docs to LLM-Ready Markdown

About Tech Docs to LLM-Ready Markdown

What does this actor do?

Key Features

How to Use

Documentation

Categories

Common Use Cases

Market Research

Lead Generation

Price Monitoring

Content Aggregation

Ready to Get Started?

Actor Information

Related Actors

Need Professional Help?