Metadata Scraper

Name: Metadata Scraper
Author: louisdeconinck

by louisdeconinck

Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata...

44,269 runs

123 users

Try This Actor

Opens on Apify.com

About Metadata Scraper

Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination.

What does this actor do?

Metadata Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.

Key Features

Cloud-based execution - no local setup required
Scalable infrastructure for large-scale operations
API access for integration with your applications
Built-in proxy rotation and anti-blocking measures
Scheduled runs and webhooks for automation

How to Use

Click "Try This Actor" to open it on Apify
Create a free Apify account if you don't have one
Configure the input parameters as needed
Run the actor and download your results

Documentation

Metadata Scraper Automatically scrape metadata such as title, description, heading and article from websites. It will crawl the start URLs and then scrape the metadata from the detail pages automatically navigating through the pagination. ## Features - Scrapes metadata from specified websites - Handles pagination and detail pages - Extracts title, description, heading, and article content - Configurable start URLs and maximum requests per crawl - Ignores specified URLs so no duplicates when scraping multiple times ## Input Be sure to use JSON mode for the input and not Manual mode. Here's an overview of the input parameters: - `startUrls`: An array of objects containing: - `url`: The starting URL for the scrape - `scrapeUrlGlobs`: An array of URL patterns for detail pages to scrape - `paginationUrlGlobs`: An array of URL patterns for pagination pages (optional) - `maxRequestsPerCrawl`: Maximum number of requests per crawl (default: 100) - `urlsToIgnore`: An array of URLs to ignore when processing (optional) Here's an example of the input data structure: json { "startUrls": [ { "url": "https://roger-hannah.co.uk/property-search/?search_properties=1&tenure=&property_type%5B%5D=Development&property_type%5B%5D=Industrial&size_min=0&size_max=1000000", "scrapeUrlGlobs": ["https://roger-hannah.co.uk/properties/"], "paginationUrlGlobs": [] } ], "maxRequestsPerCrawl": 100, "urlsToIgnore": [ "https://roger-hannah.co.uk/properties/development-site-with-potential-for-10-houses-planning-permission/", "https://roger-hannah.co.uk/properties/lower-mill-mill-street/" ] } ### Using Glob Patterns Glob patterns are used to match URLs. They are similar to regular expressions but more flexible. They are used to match the URL patterns for detail pages and pagination pages. Here are some common glob patterns used in URL matching: 1. ``: Matches any number of characters (except `/`) Example: `https://example.com/.html` matches all HTML files in the root directory 2. ``: Matches any number of characters (including `/`) Example: `https://example.com//.jpg` matches all JPG files in any subdirectory 3. `?`: Matches exactly one character Example: `https://example.com/page?.html` matches page1.html, pageA.html, etc. 4. `[...]`: Matches any one character in the brackets Example: `https://example.com/file[123].txt` matches file1.txt, file2.txt, file3.txt 5. `[!...]`: Matches any one character not in the brackets Example: `https://example.com/img[!0-9].png` matches imgA.png but not img1.png 6. `{...}`: Matches any of the comma-separated patterns Example: `https://example.com/{blog,news}/.html` matches both blog and news HTML files Examples in the context of web scraping: - `https://example.com/products/.html`: Matches all product detail pages - `https://example.com/category//page-.html`: Matches pagination pages in all categories - `https://example.com/{2021,2022,2023}/**`: Matches all pages from specific years - `https://example.com/page/*`: Matches all pages in the root directory - `https://example.com/page/**`: Matches all pages in all subdirectories When using glob patterns in the `startGlobs` configuration, make sure they accurately represent the structure of the website you're scraping to ensure all relevant pages are captured. ## Output The Actor outputs the following data for each scraped property listing: - `url`: The URL of the scraped page - `title`: The title of a detail page - `description`: The description of a detail page - `heading`: The main heading of a detail page - `article`: The content of a detail page Here's an example of the output data structure: json { "url": "https://roger-hannah.co.uk/properties/bolton-street/", "title": "Bolton Street - Roger Hannah", "description": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl...", "heading": "Bolton Street", "article": "Property Information The property comprises of a detached former warehouse/showroom facility constructed by way of a steel portal frame with concrete render under a pitched tiled roof. Access to the property is via personnel entrance doors fronting Bolton Street with rear loading access off Millett Street via two electrically operated roller shutter loading doors. There is a small private yard/parking/loading area to the rear of the premises. Internally, the facility provided flexible ground fl..." }

Common Use Cases

Market Research

Gather competitive intelligence and market data

Lead Generation

Extract contact information for sales outreach

Price Monitoring

Track competitor pricing and product changes

Content Aggregation

Collect and organize content from multiple sources

Ready to Get Started?

Try Metadata Scraper now on Apify. Free tier available with no credit card required.

Start Free Trial

Actor Information

Developer: louisdeconinck
Pricing: Paid
Total Runs: 44,269
Active Users: 123

Related Actors

Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.

by invideoiq

Linkedin Profile Details Scraper + EMAIL (No Cookies Required)

by apimaestro

Twitter (X.com) Scraper Unlimited: No Limits

by apidojo

Content Checker

by jakubbalada

Browse All Actors

Apify Platform

Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.

Learn more about Apify

Need Professional Help?

Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.

Find a Specialist

Trusted by millions | Money-back guarantee | 24/7 Support