Tech Docs to LLM-Ready Markdown

Tech Docs to LLM-Ready Markdown

by hedelka

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LL...

41 runs
7 users
Try This Actor

Opens on Apify.com

About Tech Docs to LLM-Ready Markdown

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

What does this actor do?

Tech Docs to LLM-Ready Markdown is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Tech Docs to LLM-Ready Markdown Scraper 🚀 Convert any technical documentation site to clean, structured Markdown — ready for RAG pipelines, LLM training, and AI assistants. ## Why This Actor? While generic web scrapers dump raw HTML, this Actor is specifically designed for technical documentation: | Feature | Generic Scrapers | This Actor | |---------|-----------------|------------| | Code block preservation | ❌ Lost or broken | ✅ With language tags | | Framework-aware extraction | ❌ One-size-fits-all | ✅ Docusaurus, GitBook, MkDocs | | Navigation removal | ❌ Mixed with content | ✅ Clean content only | | RAG-ready output | ❌ Needs post-processing | ✅ doc_id, section_path, chunking | ### 🔄 Before / After
❌ Generic Scraper Output (messy HTML noise) Skip to main content | Docs | Community | Blog | GitHub | Search docs... | Introduction | Quick Start | Guides | ← Previous | Next → | Edit this page | Introduction Crawlee covers your crawling... Last updated 2 days ago | Was this page helpful? Yes No
✅ This Actor Output (clean, structured Markdown) markdown # Introduction Crawlee covers your crawling and scraping end-to-end and helps you **build reliable scrapers. Fast.** ## 🛠 Features - Single interface for **HTTP and headless browser** crawling - Persistent **queue** for URLs to crawl - Automatic **scaling** with available system resourcesjavascript import { PlaywrightCrawler } from 'crawlee';
📚 More real examples → (Docusaurus, MkDocs, ReadTheDocs) ## 🎯 RAG-First Output Every result includes fields optimized for vector databases and LLM loaders: json { "doc_id": "acdb145c14f4310b", "url": "https://crawlee.dev/docs/introduction", "title": "Introduction | Crawlee", "section_path": "Guides > Quick Start > Introduction", "content": "# Introduction\n\nCrawlee covers your crawling...", "framework": "docusaurus", "chunk_index": 0, "total_chunks": 1, "metadata": { "crawledAt": "2025-12-12T03:34:46.151Z", "depth": 0, "wordCount": 358, "charCount": 2475 } } ## Supported Documentation Frameworks | Framework | Status | Example | |-----------|--------|---------| | Docusaurus | ✅ Verified | React, Crawlee, Playwright docs | | GitBook | ✅ Verified | Many SaaS products | | MkDocs Material | ✅ Verified | Python projects | | ReadTheDocs | ✅ Verified | Sphinx documentation | | VuePress | ✅ Supported | Vue.js ecosystem | | Nextra | ✅ Supported | Next.js docs | | Generic | ✅ Fallback | Any HTML docs | ## Input Example json { "startUrls": [{"url": "https://crawlee.dev/docs/introduction"}], "maxPages": 100, "maxDepth": 10, "enableChunking": true, "chunkSize": 2000, "outputFormat": "markdown" } ## 🔗 LangChain Integration (Python) python from langchain.document_loaders import ApifyDatasetLoader from langchain.docstore.document import Document loader = ApifyDatasetLoader( dataset_id="YOUR_DATASET_ID", dataset_mapping_function=lambda item: Document( page_content=item["content"], metadata={ "source": item["url"], "title": item["title"], "doc_id": item["doc_id"], "section": item["section_path"] } ), ) docs = loader.load() # Ready for vectorstore! from langchain.vectorstores import Chroma vectorstore = Chroma.from_documents(docs, embeddings) ## 🦙 LlamaIndex Integration python from llama_index.readers.apify import ApifyActor reader = ApifyActor("hedelka/tech-docs-scraper") documents = reader.load_data( run_input={"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50} ) # Build index directly index = VectorStoreIndex.from_documents(documents) ## 📡 API Call bash curl -X POST "https://api.apify.com/v2/acts/hedelka~tech-docs-scraper/runs?token=YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50}' ## Use Cases 1. RAG Pipelines: Feed documentation to LangChain/LlamaIndex for "Chat with Docs" 2. LLM Fine-tuning: Create high-quality datasets from official docs 3. Knowledge Bases: Build searchable documentation archives 4. AI Assistants: Power coding assistants with up-to-date API references 5. Scheduled Updates: Keep your RAG knowledge base in sync with docs ### 📅 Scheduled Docs Updates Use Apify Scheduler to automatically re-scrape documentation and update your vector store: 1. Create a Schedule in Apify Console → Schedules 2. Set cron: 0 0 * * 0 (weekly) or 0 0 1 * * (monthly) 3. Use a Webhook to trigger re-indexing in your RAG pipeline json { "startUrls": [{"url": "https://docs.example.com"}], "maxPages": 500, "preset": "large-docs", "exportJsonl": true } Your vector store always has the latest documentation! ## Pricing Pay per Result: $0.50 per 1,000 pages | Pages | Cost | |-------|------| | 100 | $0.05 | | 1,000 | $0.50 | | 10,000 | $5.00 | ## Author Built with ❤️ by HEDELKA for the LLM/RAG community. Questions? Issues? Open a GitHub issue or contact on Apify.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Tech Docs to LLM-Ready Markdown now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
hedelka
Pricing
Paid
Total Runs
41
Active Users
7
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support