GithubScraper

Name: GithubScraper
Author: fornace

by fornace

Automatically scrapes and downloads Markdown documentation from GitHub repositories, for easy AI finetuning.

736 runs

8 users

Try This Actor

Opens on Apify.com

About GithubScraper

Automatically scrapes and downloads Markdown documentation from GitHub repositories, for easy AI finetuning.

What does this actor do?

GithubScraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

GitHub Markdown Documentation Downloader This actor is designed to aggregate `.md` and `.mdx` files containing Markdown documentation from specified GitHub repositories. It navigates through the repository's file structure and downloads the files, which are useful for training or finetuning models. ### Features - Downloads `.md` and `.mdx` files from GitHub repositories. - Utilizes KeyValueStore to maintain coherence across concurrent executions. - Ensures documentation coherence by avoiding downloads from commits and other branches. ### Usage Set the `startUrl` to the home directory of the docs folder in the GitHub repository and run the actor. ### Input Parameters - `startUrl`: The starting URL of the GitHub repository's documentation directory. - `globPattern`: Glob pattern to match files within the repository. Defaults to '*/.{md,mdx}'. - `maxConcurrency`: The maximum number of requests processed concurrently. Default is 1000. - `maxRequestsPerMinute`: The maximum number of requests made per minute. Default is 600. - `minConcurrency`: The minimum number of concurrent requests during execution. Default is 5. - `desiredConcurrency`: The initially desired number of concurrent requests. Default is 15. ### Output The actor outputs each Markdown file's content into the default dataset. Each entry contains the file name and content. ### Example Input `json { "startUrl": "https://github.com/apify/apify-docs/tree/master", "globPattern": "**/*.mdx", "crawlerOptions": { "maxConcurrency": 10 } }` ### Support For support, contact info@fornace.it.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try GithubScraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: fornace
Pricing: Paid
Total Runs: 736
Active Users: 8

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support