Extended GPT Scraper
by drobnikj
Need to get structured data from websites and actually do something useful with it? That's where the Extended GPT Scraper comes in. It's a straightfor...
Opens on Apify.com
About Extended GPT Scraper
Need to get structured data from websites and actually do something useful with it? That's where the Extended GPT Scraper comes in. It's a straightforward actor that handles both parts: first, it scrapes the content you need from any public webpage. Then, it automatically pipes that raw text into OpenAI's GPT models via the API. Think of it as giving ChatGPT the ability to read and process the entire internet for you. You can use it to clean up and proofread scraped blog posts, gauge customer sentiment from product reviews, or condense long articles into concise summaries. One of the most practical uses is for lead generation—set it to crawl business directories or company sites to find and extract email addresses and contact details, saving you hours of manual work. I've used it to process batches of support forum comments for common issues and to pull key takeaways from competitor news pages. Because it's open-source, you can tweak the prompts and logic to fit your exact project, whether that's for AI training data preparation, content analysis, or building custom datasets. It turns messy, unstructured web data into actionable information you can actually use.
What does this actor do?
Extended GPT Scraper is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Extended GPT Scraper Extended GPT Scraper is a powerful tool that leverages OpenAI's API to modify text obtained from a scraper. You can use the scraper to extract content from a website and then pass that content to the OpenAI API to make the GPT magic happen. ## How does Extended GPT Scraper work? The scraper first loads the page using Playwright, then it converts the content into markdown format and asks for GPT instructions about markdown content. If the content doesn't fit into the GPT limit, the scraper will truncate the content. You can find the message about truncated content in the log. ## How much does it cost? There are two costs associated with using GPT Scraper. ### Cost of the OpenAI API You can find the cost of the OpenAI API on the OpenAI pricing page. The cost depends on the model you are using and the length of the content you are sending to the API for scraping. ### Cost of the scraping itself The cost of the scraper is the same as the cost of Web Scraper, because it uses the same browser under the hood. You can find information about the cost on the pricing page under the Detailed Pricing breakdown section. The cost estimates are based on averages and may vary depending on the complexity of the pages you scrape. > If you are looking for a basic and more predictable GPT Scraper that includes OpenAI API's cost, check out the GPT Scraper. It is also able to extract content from a website and then pass that content to the OpenAI API. ## How to use Extended GPT Scraper To get started with Extended GPT Scraper, you need to set up the pages you want to scrape using Start URLs and set up instructions for how the scraper should handle each page and the OpenAI API key. NOTE: You can find the OpenAI API key in your OpenAI dashboard. You can configure the scraper and GTP using Input configuration to set up a more complex workflow. ## Input configuration Extended GPT Scraper accepts a number of configuration settings. These can be entered either manually in the user interface in Apify Console or programmatically in a JSON object using the Apify API. For a complete list of input fields and their types, please see the outline of the Actor's Input-schema. ### Start URLs The Start URLs (startUrls) field represents the initial list of page URLs that the scraper will visit. You can enter a group of URLs together using file upload or one by one. The scraper supports adding new URLs to scrape on the fly, either using the Link selector or Glob patterns options. ### Link selector The Link selector (linkSelector) field contains a CSS selector that is used to find links to other web pages (items with href attributes, e.g. <div class="my-class" href="...">). On every page that is loaded, the scraper looks for all links matching Link selector, and checks that the target URL matches one of the Glob patterns. If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on. If Link selector is empty, the page links are ignored, and the scraper only loads pages specified in Start URLs. ### Glob patterns The Glob patterns (globs) field specifies which types of URLs found by Link selector should be added to the request queue. A glob pattern is simply a string with wildcard characters. For example, a glob pattern http://www.example.com/pages/**/* will match all the following URLs: - http://www.example.com/pages/deeper-level/page - http://www.example.com/pages/my-awesome-page - http://www.example.com/pages/something ### OpenAI API key The API key for accessing OpenAI. You can get it from OpenAI platform. ### Instructions and prompts for GPT This option tells GPT how to handle page content. For example, you can send the following prompts. - "Summarize this page in three sentences." - "Find sentences that contain 'Apify Proxy' and return them as a list." You can also instruct OpenAI to answer with "skip this page" if you don't want to process all the scraped content, e.g. - "Summarize this page in three sentences. If the page is about proxies, answer with 'skip this page'.". ### GPT Model The GPT Model (model) option specifies which GPT model to use. You can find more information about the models on the OpenAI API documentation. Keep in mind that each model has different pricing and features. ### Max crawling depth This specifies how many links away from Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. ### Max pages per run The maximum number of pages that the scraper will open. 0 means unlimited. ### Formatted output If you want to get data in a structured format, you can define JSON schema using the Schema input option and enable the Use JSON schema to format answer option. This schema will be used to format data into a structured JSON object, which will be stored in the output in the jsonAnswer attribute. ### Proxy configuration The Proxy configuration (proxyConfiguration) option enables you to set proxies. The scraper will use them to prevent its detection by target websites. You can use both Apify Proxy and custom HTTP or SOCKS5 proxy servers.
Categories
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Extended GPT Scraper now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- drobnikj
- Pricing
- Paid
- Total Runs
- 491,339
- Active Users
- 1,569
Related Actors
Google Search Results Scraper
by apify
Website Content Crawler
by apify
đŸ”¥ Leads Generator - $3/1k 50k leads like Apollo
by microworlds
Video Transcript Scraper: Youtube, X, Facebook, Tiktok, etc.
by invideoiq
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support