arXiv Pro Scraper - API & Full Text

Name: arXiv Pro Scraper - API & Full Text
Author: exuberant_promotion

by exuberant_promotion

A professional, low-cost arXiv scraper that uses the official API to find papers, then downloads, cleans, and chunks the full PDF text—creating AI-rea...

40 runs

3 users

Try This Actor

Opens on Apify.com

About arXiv Pro Scraper - API & Full Text

A professional, low-cost arXiv scraper that uses the official API to find papers, then downloads, cleans, and chunks the full PDF text—creating AI-ready datasets in one click.

What does this actor do?

arXiv Pro Scraper - API & Full Text is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

arXiv API & Full-Text Scraper (AI-Ready) This Apify Actor provides a complete, AI-ready dataset from arXiv.org. It uses the official arXiv API for fast and reliable metadata scraping, then downloads, cleans, and chunks the full text from each paper's PDF. This tool is designed for AI/LLM developers, researchers, and data scientists who need high-quality text corpora for model training and Retrieval-Augmented Generation (RAG) pipelines. ## Key Competitive Features * ⚡️ Blazing Fast & Low Cost: Uses the official arXiv API, not a slow browser, to find papers. This is thousands of times faster and cheaper than other scrapers. * 🤖 AI-Ready Chunking: Automatically splits the clean text into overlapping chunks, perfect for ingestion into vector databases (RAG). * 🧹 Automatic Text Cleaning: Cleans the raw PDF text to remove headers, footers, page numbers, and bibliographies. * 🎯 Powerful Search: Scrape by keyword, category code (e.g., `cs.AI`), or both. --- ## Input The actor accepts the following JSON input. `searchQuery` or `category` is required. | Field | Type | Description | Default | | :--- | :--- | :--- | :--- | | `searchQuery` | String | (Optional) The keyword search query (e.g., "black hole physics"). | | | `category` | String | (Optional) The arXiv category code (e.g., `cs.AI` for AI, `gr-qc` for General Relativity). | | | `maxPages` | Number | The maximum number of result pages to scrape (50 results per page). | `1` | | `chunkSize` | Number | (Optional) The target size (in characters) for each text chunk. | `1000` | | `chunkOverlap`| Number | (Optional) The number of characters to overlap between chunks. | `200` | ### Example Input (Simple Keyword Search) `json { "searchQuery": "quantum computing", "maxPages": 3 } ### Example Input (AI-Ready Category Search)`json { "category": "cs.CL", "maxPages": 10, "chunkSize": 1500, "chunkOverlap": 300 } ### Example Output { "title": "Quantum black holes: inside and outside", "authors": "Bernard S. Kay", "abstract": "We review and add to a conjectured 'principle' according to which...", "url": "http://arxiv.org/abs/2510.20799v1", "cleanedFullText": "Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which...\n\n[... rest of the cleaned paper text ...]\n\n", "textChunks": [ "Quantum black holes: inside and outside\n\nBernard S. Kay\n\n(Dated: October 23, 2025)\n\nAbstract\nWe review and add to a conjectured 'principle' according to which a necessary condition for a ... [900 more characters]", "necessary condition for a ... [800 more characters] ... The principle is intended to apply only to 'civilized' spacetimes. We shall recall a ... [and 200 more characters]", "[... etc ...]" ] }

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try arXiv Pro Scraper - API & Full Text now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: exuberant_promotion
Pricing: Paid
Total Runs: 40
Active Users: 3

Related Actors

Google Search Results Scraper

by apify

Website Content Crawler

by apify

🔥 Leads Generator - $3/1k 50k leads like Apollo

by microworlds

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support