Shopify Scraper (GraphQL)

Shopify Scraper (GraphQL)

by runexes

Scrape Shopify stores efficiently using their sitemap and official GraphQL API. Get clean product data fast with batching, incremental processing, and lower costs.

38 runs
12 users
Try This Actor

Opens on Apify.com

About Shopify Scraper (GraphQL)

If you've ever tried scraping a Shopify store, you know it can be a real headache. You either get blocked, miss half the data, or the script crawls at a snail's pace. That's why I built this scraper. It works the way Shopify expects: first, it intelligently crawls the store's `sitemap.xml` to find every product page. Then, instead of scraping the messy HTML, it makes clean, direct requests to Shopify's own Storefront GraphQL API to pull structured product data like titles, variants, prices, and images. This method is not only more reliable but also respects the store's infrastructure. The real magic is in the optimizations. I've added per-host batching to group requests, which drastically cuts down on the number of API calls and keeps costs low. It processes data incrementally, so if your run gets interrupted, you can pick up right where you left off. All the data is written to a buffered dataset, which means it's saved efficiently as it comes in, preventing memory issues on large jobs. I use this tool myself for competitive analysis, price monitoring, and building product catalogs. It's the fastest, most cost-effective way I've found to get clean, complete data from any Shopify store.

What does this actor do?

Shopify Scraper (GraphQL) is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

  • Cloud-based execution - no local setup required
  • Scalable infrastructure for large-scale operations
  • API access for integration with your applications
  • Built-in proxy rotation and anti-blocking measures
  • Scheduled runs and webhooks for automation

How to Use

  1. Click "Try This Actor" to open it on Apify
  2. Create a free Apify account if you don't have one
  3. Configure the input parameters as needed
  4. Run the actor and download your results

Documentation

Shopify Scraper (GraphQL)

An Apify actor that extracts product data from Shopify stores. It crawls a store's sitemap.xml to find product pages and then uses the Shopify Storefront GraphQL API to fetch detailed product information efficiently.

Overview

This actor is designed for speed and cost-effectiveness. It processes stores by batching GraphQL requests, supports incremental runs to avoid re-scraping, and can filter products based on their last modified date. Output is structured as one record per product, with all variants nested inside.

Key Features

  • Sitemap-based Crawling: Discovers product URLs from a store's sitemap.xml (specifically /products/<handle> paths).
  • Efficient GraphQL Queries: Batches multiple product requests into single GraphQL calls using aliases, reducing network overhead.
  • Incremental Processing: Can skip products that have already been processed in previous runs.
  • Date Filtering: Optionally ignores products not updated since a given date (updatedSince).
  • Performance Tuning: Configurable batch size, concurrency, and buffered dataset writes.
  • Extensible: Use extendScraperFunction for custom logic during the scraping lifecycle and extendOutputFunction to transform final records.

Input / Configuration

Core parameters required to run the actor:

  • startUrls: Array containing your target sitemap.xml URL(s).
  • storefrontApiVersion: The Shopify Storefront API version to use (e.g., 2024-07).
  • storefrontAccessToken: Your store's Storefront API access token.

Essential performance and utility settings:

  • maxRequestsPerCrawl, maxConcurrency, maxRequestRetries, proxyConfiguration: Standard Apify crawl controls.
  • updatedSince: ISO date string; skips products with a <lastmod> older than this.
  • batchSize: Number of product handles to query per GraphQL request (default: 10).
  • perHostConcurrency: Parallel GraphQL requests allowed per store host (default: 2).
  • bufferWrites & bufferSize: Controls buffering for dataset writes to improve performance.

How to Use

Running Locally

  1. Install dependencies:
    bash npm install
  2. Create an input file at apify_storage/key_value_stores/default/INPUT.json:
    json { "startUrls": [{ "url": "https://example.com/sitemap.xml" }], "storefrontApiVersion": "2024-07", "storefrontAccessToken": "<YOUR_TOKEN>", "maxRequestsPerCrawl": 50 }
  3. Start the actor:
    bash npm start
    For development with auto-restart:
    bash npm run dev

Docker Quick Start

Using the provided Makefile:

make init  # Creates .env and INPUT.json from templates
make run   # Builds and starts the Docker container

Output datasets will be in apify_storage/datasets/default.

Output

The actor saves one item per product to the dataset. The product's variants are available within the additional.variants property of each record. The structure is based on the response from the Shopify Storefront GraphQL API.

Extensibility

You can inject custom logic at specific points:

  • extendScraperFunction: Provides lifecycle hooks (SETUP, FILTER_SITEMAP_URL, PRENAVIGATION, POSTNAVIGATION, RUN, FINISHED) for custom actions.
  • extendOutputFunction: Allows you to modify or filter the final product record before it is saved to the dataset.

Project Info

  • License: Apache License 2.0 (see LICENSE and NOTICE files).
  • CI/CD: GitHub Actions workflows (ci.yml, codeql.yml) handle testing, linting, and security analysis on pushes and pull requests.

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Shopify Scraper (GraphQL) now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer
runexes
Pricing
Paid
Total Runs
38
Active Users
12
Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support