Arxiv Citation Network Scraper

by codepoetry

A professional Apify Actor that scrapes academic papers from arXiv and builds citation networks. Extract paper metadata, analyze author collaborations...

13 runs

2 users

Opens on Apify.com

About Arxiv Citation Network Scraper

A professional Apify Actor that scrapes academic papers from arXiv and builds citation networks. Extract paper metadata, analyze author collaborations, track research trends, and discover emerging topics in science and technology.

What does this actor do?

Arxiv Citation Network Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

arXiv Citation Network Scraper – Apify Actor A production-grade Apify Actor that turns arXiv into a structured, analysis‑ready dataset: papers, authors, collaboration networks, and research trends. It is designed to be reliable enough for paying customers building research tools, AI pipelines, and analytics products. ## 🧩 What You Get When You Pay - High‑quality academic data: Clean, structured paper metadata with authors, categories, dates, and direct PDF links. - Network & trend insights: Built‑in author collaboration networks and topic trends so you don’t have to code analytics yourself. - Stable, monitored actor: Production‑oriented implementation with error handling and tests (`test_actor.py`) to keep runs predictable. - Time savings: No need to learn the arXiv API or parse Atom feeds / HTML yourself. - Flexible exports: Use the Apify dataset UI to export to JSON, CSV, Excel, or integrate via API. ## What Is an Apify Actor? An Apify Actor is a serverless micro‑app that runs in the Apify cloud. You don’t manage servers or scaling – you just configure input, run the actor, and consume the dataset/API. ## What This Actor Does This actor provides a complete academic research data pipeline: 1. Discovers papers – Searches arXiv using their official API with flexible filters. 2. Extracts metadata – Titles, abstracts, authors, categories, publication dates, PDF links. 3. Builds networks – Co‑authorship and author collaboration structures. 4. Analyzes trends – Top categories, prolific authors, and monthly publication volumes. 5. Delivers insights – All data is pushed into an Apify dataset with a friendly output schema. ## Key Features - ✅ Search by keywords, categories, or date ranges - ✅ Structured paper metadata (title, abstract, authors, categories, dates, links) - ✅ Author collaboration network analysis - ✅ Research trend detection (top categories, authors, monthly volumes) - ✅ Direct PDF download links (optional) - ✅ No authentication required, uses the official arXiv API - ✅ Output schema optimized for Apify UI (nice tables & views) ## Typical Use Cases ### For Researchers & Academics - Literature Review - Quickly gather papers on specific topics - Author Discovery - Find key researchers and collaboration networks - Trend Analysis - Identify emerging research areas - Citation Tracking - Build citation networks for meta-analysis ### For AI & Tech Companies - Training Data - Collect academic papers for AI model training - Research Intelligence - Track competitors and emerging technologies - Talent Discovery - Identify leading researchers for recruitment - Dataset Creation - Build curated research datasets ### For Developers & Analysts - Academic Databases - Power search engines and research platforms - Visualization Tools - Feed network graphs and trend dashboards - API Integration - Automated research monitoring systems - Data Analysis - Export to CSV/Excel for custom analysis ## How to Run This Actor on Apify 1. Open the actor on Apify. 2. In the Input tab, fill in the parameters (or use a template below). 3. Click Start. 4. When the run finishes, open the Dataset tab to explore the results in a friendly table view or export them. ### Minimal Input Example `json { "searchQuery": "machine learning", "category": "cs.AI", "maxPapers": 100, "extractCitations": true, "includePdfLink": true, "dateFrom": "2024-01-01", "dateTo": "2024-12-31" }` ### Input Parameters | Parameter | Type | Required | Description | Example | |-----------|------|----------|-------------|---------| | `searchQuery` | String | No | Keywords to search for | "neural networks" | | `category` | String | No | arXiv category filter | "cs.AI", "cs.LG", "physics.quant-ph" | | `maxPapers` | Integer | No | Max papers to scrape (1-1000) | 100 | | `extractCitations` | Boolean | No | Extract citation metadata | true | | `includePdfLink` | Boolean | No | Include PDF download URLs | true | | `dateFrom` | String | No | Filter papers after date (YYYY-MM-DD) | "2024-01-01" | | `dateTo` | String | No | Filter papers before date (YYYY-MM-DD) | "2024-12-31" | ### Popular arXiv Categories - `cs.AI` - Artificial Intelligence - `cs.LG` - Machine Learning - `cs.CV` - Computer Vision - `cs.CL` - Computation and Language (NLP) - `cs.RO` - Robotics - `physics.quant-ph` - Quantum Physics - `math.CO` - Combinatorics - `stat.ML` - Machine Learning (Statistics) Full list of categories ### Ready‑Made Input Examples Find recent AI papers: `json { "category": "cs.AI", "maxPapers": 50, "dateFrom": "2024-01-01" }` Search for quantum computing papers: `json { "searchQuery": "quantum computing", "maxPapers": 30, "extractCitations": true }` Track specific author's work: `json { "searchQuery": "Yoshua Bengio", "maxPapers": 20 }` Build machine learning dataset: `json { "category": "cs.LG", "maxPapers": 500, "dateFrom": "2023-01-01", "includePdfLink": true }` ## Output Format (What You See in the Dataset) The actor uses a dedicated `output_schema.json` so that the Apify UI shows clean, labeled columns and views. ### Individual Paper Records Each paper is returned as a structured JSON object: json { "arxiv_id": "2401.12345", "title": "Advances in Neural Network Architectures", "summary": "This paper presents novel approaches to neural network design...", "authors": [ "Jane Doe", "John Smith", "Alice Johnson" ], "primary_category": "cs.LG", "categories": ["cs.LG", "cs.AI", "stat.ML"], "published": "2024-01-15", "updated": "2024-01-20", "url": "https://arxiv.org/abs/2401.12345", "pdf_url": "https://arxiv.org/pdf/2401.12345.pdf", "comment": "10 pages, 5 figures, accepted to NeurIPS 2024", "citation_data": { "arxiv_id": "2401.12345", "references_extracted": true, "doi": "10.1234/example", "journal_reference": "NeurIPS 2024" } } ### Author Network Analysis `json { "type": "author_network", "data": { "author_papers": { "Jane Doe": ["2401.12345", "2312.54321"], "John Smith": ["2401.12345"] }, "collaborations": [ { "authors": ["Jane Doe", "John Smith"], "count": 3 } ], "total_authors": 156, "total_collaborations": 423 }, "generated_at": "2024-11-18T10:30:00.123456" }` ### Trend Analysis `json { "type": "trend_analysis", "data": { "top_categories": [ {"category": "cs.LG", "count": 45}, {"category": "cs.AI", "count": 38} ], "top_authors": [ {"author": "Jane Doe", "papers": 5}, {"author": "John Smith", "papers": 3} ], "papers_per_month": { "2024-01": 12, "2024-02": 15, "2024-03": 18 }, "total_papers": 100, "total_categories": 8, "unique_authors": 245 }, "generated_at": "2024-11-18T10:30:00.123456" }` ## Running Locally (Optional for Developers) You don’t need this for normal paid use on Apify, but if you want to test or extend the actor locally: `bash pip install -r requirements.txt python test_actor.py # run the test suite python test_local.py # simple manual test` ## Using the Data in Your Product - Build internal research dashboards: plug the dataset into BI tools (Tableau, Power BI, Metabase). - Feed AI & LLM pipelines: use abstracts and metadata as high‑quality training or retrieval corpora. - Power academic search or recommendation features: index papers by topic, author, and time. - Track research signals: monitor new papers in specific categories over time. ## Project Structure (For Technical Users) `arxiv-citation-network-scraper/ ├── .actor/ │ ├── actor.json # Actor metadata and configuration │ └── input_schema.json # Input form schema for Apify UI ├── src/ │ └── main.py # Main actor code with scraping logic ├── Dockerfile # Container configuration ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules └── README.md # This file` ## How It Works (Under the Hood) ### Technical Flow 1. API Query Construction - Builds arXiv API query from user parameters - Supports keyword search, category filters, date ranges - Uses proper URL encoding and parameter formatting 2. Data Fetching - Fetches data from arXiv API (Atom feed format) - Parses XML/Atom using feedparser library - Handles pagination and rate limiting 3. Metadata Extraction - Extracts paper title, abstract, authors - Captures categories, dates, arXiv ID - Generates PDF and abstract page URLs 4. Citation Analysis - Scrapes individual paper pages for citation metadata - Extracts DOI and journal references when available - Builds citation network data structure 5. Network Building - Analyzes author collaboration patterns - Identifies co-authorship relationships - Counts collaboration frequency 6. Trend Analysis - Aggregates papers by category - Tracks publication trends over time - Identifies most prolific authors 7. Data Output - Pushes individual paper records to Apify dataset - Adds network analysis summary - Includes trend analysis report ## Technical Details Dependencies: - `apify` (>=2.1.0) - Apify SDK for Python - `beautifulsoup4` (4.12.3) - HTML parsing for citation extraction - `requests` (2.31.0) - HTTP requests - `lxml` (>=5.3.0) - Fast XML/HTML parser - `feedparser` (>=6.0.11) - Atom/RSS feed parsing API Information: - Source: arXiv.org official API - Documentation: https://arxiv.org/help/api - Rate Limits: 3 seconds between requests (handled automatically) - Max Results: 30,000 per query (practical limit ~1000 for performance) Performance: - ~50 papers: 10-20 seconds - ~100 papers: 20-40 seconds - ~500 papers: 2-3 minutes - Citation extraction adds ~0.5s per paper Error Handling: - Network errors are caught and logged - Failed paper scrapes don't stop the actor - Graceful degradation for missing metadata - Detailed logging for debugging ## Limitations & Notes - arXiv abstract pages don't include full reference lists (would require PDF parsing) - Citation extraction is limited to metadata available on abstract pages - Date filtering is applied post-fetch (API limitations) - Large result sets (>1000 papers) may take several minutes - Some papers may have incomplete metadata - PDF links are direct URLs, not downloaded files ## API Integration (For Automation) Once the actor is in your Apify account, you can start runs and read datasets via the Apify API. This makes it easy to plug the actor into your pipelines, cron jobs, and backend services. ## Troubleshooting No papers found: - Verify your search query and category are correct - Try broadening your search (remove filters) - Check if arXiv API is accessible: https://arxiv.org/help/api Slow performance: - Reduce `maxPapers` parameter - Disable `extractCitations` for faster runs - Consider running during off-peak hours Missing metadata: - Some papers have incomplete information on arXiv - This is normal and handled gracefully - Check the logs for specific issues Local execution issues: - Ensure all dependencies are installed: `pip install -r requirements.txt` - Check Python version: `python --version` (needs 3.8+) - Verify internet connectivity to arxiv.org ## Resources - arXiv.org - The source platform - arXiv API Documentation - Official API docs - Apify Platform - Run and scale actors - Apify Python SDK - SDK documentation - arXiv Category Taxonomy - All categories ## FAQ Q: Is this allowed under arXiv’s terms? Yes. The actor uses the official arXiv API and respects its documented usage guidelines. Q: How many papers can I fetch? Practically up to around 1,000 per run for good performance. You can run the actor multiple times with different filters. Q: Can I run this on a schedule? Yes. On Apify you can set a schedule (e.g., daily) to keep your dataset up to date. Q: Does it download PDFs? It returns direct `pdf_url` links. You can then download PDFs in your own pipeline if needed. Q: Who is this for? Research teams, data scientists, AI/LLM engineers, and product teams who need reliable, structured arXiv data without maintaining their own scraper. --- Ready to turn arXiv into actionable research data? Run the actor on Apify and start exploring.

Categories

DEVELOPER_TOOLS SEO_TOOLS

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Arxiv Citation Network Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: codepoetry
Pricing: Paid
Total Runs: 13
Active Users: 2

Related Actors

Web Scraper

Web Scraper

by apify

Cheerio Scraper

Cheerio Scraper

by apify

Website Content Crawler

Website Content Crawler

by apify

Legacy PhantomJS Crawler

Legacy PhantomJS Crawler

by apify

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support