Google Scholar Scraper: Articles, Citations & PDFs

Google Scholar Scraper: Articles, Citations & PDFs

by primeparse

Extract academic data from Google Scholar: titles, authors, years, citations, abstracts, PDF links. Supports queries, year filters (1900-2100), pagina...

2 runs
2 users
Try This Actor

Opens on Apify.com

About Google Scholar Scraper: Articles, Citations & PDFs

Extract academic data from Google Scholar: titles, authors, years, citations, abstracts, PDF links. Supports queries, year filters (1900-2100), pagination (up to 5 pages). Rate-limited for safety. Ideal for research, citations, datasets, AI. Clean JSON output. Run on Apify with proxies.

What does this actor do?

Google Scholar Scraper: Articles, Citations & PDFs is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

πŸ”¬ Google Scholar Scraper: Academic Research Data Extractor Enterprise-grade Google Scholar scraper for academic research and data analysis. Collects structured data from Google Scholar search results including titles, authors, citations, abstracts, and PDF links. Ideal for literature reviews, citation analysis, and academic dataset building. Features intelligent parsing, rate limiting, and year filtering. High-quality Google Scholar Data Extractor for Researchers, Academics, and Data Scientists Automatically searches Google Scholar, extracts article metadata, filters by publication year, and collects citation data β€” clean, structured, ready for analysis or academic research. Built for: - Academic researchers conducting literature reviews - Data scientists building research datasets - PhD students tracking citations and publications - Librarians organizing academic resources - Research teams monitoring publication trends - AI/ML engineers collecting training data from academic sources βœ… Smart search with keyword queries βœ… Year range filtering (1900-2100) βœ… Rich metadata extraction (title, authors, year, citations, abstract, PDF links) βœ… Automatic pagination support (up to 5 pages) βœ… Rate limiting & respectful crawling βœ… AI-ready structured output πŸ‘‰ Runs on Apify β€’ No code required ## πŸš€ Why This Scraper ### βœ” Purpose-Built for Academic Research Intelligently extracts structured data from Google Scholar search results β€” perfect for literature reviews, citation analysis, and academic research. ### βœ” Comprehensive Metadata Extraction Extracts all essential academic metadata: article titles, author lists, publication years, citation counts, abstracts, PDF links, and Google Scholar page URLs. ### βœ” Clean & Structured Output Produces clean, structured JSON output ready for analysis, database import, or further processing. Perfect for academic datasets and research workflows. ### βœ” Smart Year Filtering Filter results by publication year range to focus on recent research or historical publications. Supports years from 1900 to 2100. ### βœ” AI & ML Ready Structured JSON output perfect for RAG systems, LLM fine-tuning, academic knowledge bases, or training datasets for research applications. ### βœ” Fast & Efficient Powered by Puppeteer for reliable browser automation. Handles dynamic content and JavaScript-rendered pages efficiently. ### βœ” Safe & Controlled Processing Built-in rate limiting (1-2 second delays), configurable pagination limits, and graceful error handling to respect Google Scholar's infrastructure. ## πŸ’Ό Use Cases - Literature reviews β€” Collect and analyze academic papers for systematic reviews - Citation tracking β€” Monitor citation counts and track research impact - Publication monitoring β€” Track new publications in specific research areas - Dataset building β€” Create structured datasets for academic research or AI training - Competitive research β€” Monitor competitor publications and research trends - Academic analysis β€” Analyze publication patterns, author networks, and citation trends - PDF collection β€” Automatically collect PDF links for offline research ## πŸ“Š Supported Data - Article titles β€” Full publication titles - Authors β€” Complete author lists (up to 10 authors per article) - Publication years β€” Extracted from metadata - Citation counts β€” Number of citations for each article - Abstracts β€” Article abstracts when available - PDF links β€” Direct links to PDF files when available - Google Scholar links β€” Direct links to article pages on Google Scholar ## βš™οΈ How It Works 1. Enter your search query (e.g., "machine learning", "quantum computing") 2. Optionally set year range filters and pagination limits 3. Configure proxy settings for reliable access 4. Run the Actor 5. Download clean, structured academic datasets ## 🧩 Input Configuration ### Example JSON Input json { "query": "machine learning", "maxPages": 1, "startYear": 2020, "endYear": 2026, "proxyConfiguration": { "useApifyProxy": true } } ### Key Options - query β€” Search query string (required, e.g., "machine learning", "neural networks") - maxPages β€” Maximum number of result pages to scrape (default: 1, recommended: 1-5) - startYear β€” Filter results by minimum publication year (optional, 1900-2100) - endYear β€” Filter results by maximum publication year (optional, 1900-2100) - proxyConfiguration β€” Proxy settings for anti-bot protection (default: uses Apify Proxy) ### Search Query Tips - Use specific terms for better results (e.g., "deep learning neural networks" instead of "AI") - Combine keywords with quotes for exact phrases: "transfer learning" - Use Boolean operators: machine learning AND computer vision - Filter by author: author:"John Smith" machine learning - Filter by publication: source:"Nature" quantum computing ## πŸ“‚ Output Dataset All articles are stored in the default Apify dataset with the following structure: ### Example Output Record json { "title": "Machine learning", "authors": [ "ZH Zhou" ], "year": 2021, "citations": 3301, "abstract": "… from data is called learning or training. The … machine learning is to find or approximate ground-truth. In this book, models are sometimes called learners, which are machine learning …", "pdfLink": null, "scholarLink": "https://books.google.com/books?hl=en&lr=&id=ctM-EAAAQBAJ&oi=fnd&pg=PR6&dq=machine+learning&ots=o_OnT7Rv3p&sig=bH9TGnw_ZdZYH4lSLmKun7xX6Cs" } ### Output Fields - title (string, required) β€” Article title - authors (array, required) β€” List of author names (up to 10 authors) - year (integer|null) β€” Publication year - citations (integer|null) β€” Number of citations - abstract (string|null) β€” Article abstract when available - pdfLink (string|null) β€” Direct link to PDF file when available - scholarLink (string, required) β€” Link to Google Scholar article page ### Multiple Authors Example json { "title": "A guide to machine learning for biologists", "authors": [ "JG Greener", "SM Kandathil" ], "year": 2022, "citations": 2020, "abstract": "… A machine learning task is an objective specification for what we want a machine learning model to accomplish…", "pdfLink": "https://discovery.ucl.ac.uk/id/eprint/10134478/1/NRMCB-review-accepted-forRPS.pdf", "scholarLink": "https://www.nature.com/articles/s41580-021-00407-0" } ### Input File Example Create storage/key_value_stores/default/INPUT.json: json { "query": "quantum computing", "maxPages": 2, "startYear": 2020, "endYear": 2024 } ## πŸ“ˆ Performance - Processing Speed β€” ~3-4 seconds per page (depending on results) - Rate Limiting β€” Built-in 1-2 second delays between requests - Concurrency β€” Single request at a time for reliability - Scalability β€” Handles 1-5 pages optimally (up to 50 articles per run) - Success Rate β€” High reliability with proper proxy configuration ## πŸ”§ Advanced Configuration ### Year Range Filtering Filter results by publication year: json { "query": "artificial intelligence", "startYear": 2020, "endYear": 2026 } ### Multiple Pages Scrape multiple pages for comprehensive results: json { "query": "deep learning", "maxPages": 5 } ### Proxy Configuration Use Apify Proxy for reliable access: json { "query": "neural networks", "proxyConfiguration": { "useApifyProxy": true } } ## πŸ“§ Support - Issues β€” Use Apify Issues tab for bug reports - Documentation β€” Check Apify documentation for platform features - Community β€” Join Apify community for discussions Tags: Google Scholar, academic research, literature review, citation analysis, research data, paper scraping, academic scraping, research automation, citation tracking, publication monitoring, academic dataset, research tools, scholarly articles, PDF extraction --- Built with ❀️ on Apify

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Google Scholar Scraper: Articles, Citations & PDFs now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
primeparse
Pricing
Paid
Total Runs
2
Active Users
2
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support