Smart Article Extractor
by lukaskrivka
Tired of manually copying articles from news sites or academic journals? I was too. That's why I built the Smart Article Extractor. It's my go-to for ...
Opens on Apify.com
About Smart Article Extractor
Tired of manually copying articles from news sites or academic journals? I was too. That's why I built the Smart Article Extractor. It's my go-to for pulling clean, structured article data from practically any website—whether it's a breaking news blog, a scientific publication, or an industry journal. Here’s how it works: you give it a website URL, and it intelligently crawls the site, automatically figuring out which pages are actual articles versus navigation or contact pages. It saves you hours of tedious work. You get just the content you need: the headline, author, publish date, and full text, perfectly cleaned up. I use it all the time for building news aggregators, tracking academic research, or compiling datasets for AI training. Once it's done, you can grab your data in the format that fits your workflow. Download it as JSON for your app, an Excel spreadsheet for analysis, an HTML table for a report, or even as an RSS feed to stay updated. It just works.
What does this actor do?
Smart Article Extractor is a web scraping and automation tool available on the Apify platform. It's designed to help you extract data and automate tasks efficiently in the cloud.
Key Features
- Cloud-based execution - no local setup required
- Scalable infrastructure for large-scale operations
- API access for integration with your applications
- Built-in proxy rotation and anti-blocking measures
- Scheduled runs and webhooks for automation
How to Use
- Click "Try This Actor" to open it on Apify
- Create a free Apify account if you don't have one
- Configure the input parameters as needed
- Run the actor and download your results
Documentation
Smart Article Extractor Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them. ## What does Smart Article Extractor do? If you want to download articles from websites, this tool will help you extract content using smart scraping features: ✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data ✅ Allows extraction of articles from any number of URLs ✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable) ✅ Additional filters - date of articles, minimum words, and more ✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML ✅ Allows usage of Google Bot headers (bypassing paywalls) ## Why extract articles with Smart Article Extractor? 👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations. 👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read Terror or Clickbait? or Czech media and their word choices. 👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation. 👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy. ### Is it legal to extract articles? Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website. ## How many results can you scrape with Smart Article Extractor? Smart Article Extractor can return thousands of results on average. However, you have to keep in mind that scraping news websites has many variables to it and may cause the results to fluctuate case by case. There’s no one-size-fits-all-use-cases number. The maximum number of results may vary depending on the complexity of the input, location, and other factors. Some of the most frequent cases are: - website gives a different number of results depending on the type/value of the input - website has an internal limit that no scraper can cross - scraper has a limit that we are working on improving Therefore, while we regularly run Actor tests to keep the benchmarks in check, the results may also fluctuate without our knowing. The best way to know for sure for your particular use case is to do a test run yourself. ## How much will scraping articles with Smart Article Extractor cost you? When it comes to scraping, it can be challenging to estimate the resources needed to extract data as use cases may vary significantly. That's why the best course of action is to run a test scrape with a small sample of input data and limited output. You’ll get your price per scrape, which you’ll then multiply by the number of scrapes you intend to do. Watch this video for a few helpful tips. And don't forget that choosing a higher plan will save you money in the long run. ⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the Limit CU consumption field. ⚠️ ### How do I extract articles with Smart Article Extractor? Smart Article Extractor can be run as an Apify actor on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure. On the Apify platform: 1. Click on Try for free. 2. Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search). 3. Click on Save & Start. 4. When Smart Article Extractor has finished, preview or download your results from the Output tab. For more detailed instructions, read our step-by-step guide on how to extract articles. ## Output example If you run Smart Article Extractor on the Apify platform, you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example: json { "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told", "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told", "date": "2020-07-07T12:13:00.000Z", "author": [ "Fariha Karim" ], "publisher": null, "copyright": "Times Newspapers Limited 2020", "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico", "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told", "lang": "en", "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "tags": [], "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685", "videos": [], "links": [], "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room" } ### Extend output function You can use this optional function to update the default output of this actor. This function gets a JQuery handle $ as an argument, so you can choose what data from the page you want to scrape. It also receives the currentItem parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output. The return value of this function has to be an object! You can return fields to achieve 3 different things: - Add a new field - Return an object with a field that is not in the default output - Change a field - Return an existing field with a new value - Remove a field - Return an existing field with a value undefined Let's say you want to accomplish this: - Remove links and videos fields from the output - Add a pageTitle field - Change the date selector (In rare cases the scraper is not able to find it) - Save the original date parsed so you can compare with your date javascript ($, currentItem) => { return { links: undefined, videos: undefined, pageTitle: $('title').text(), date: $('.my-date-selector').text(), originalDate: currentItem.date, } } ## Integrations and Smart Article Extractor Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to integrations on the Apify platform. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and more. Or you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run. ## Using Smart Article Extractor with the Apify API The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more. To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package. Check out the Apify API reference docs for full details or click on the API tab for code examples. ## Not your cup of tea? Build your own scraper Smart Article Extractor doesn’t exactly do what you need? You can always build your own! We have various scraper templates in Python, JavaScript, and TypeScript to get you started. Alternatively, you can write it from scratch using our open-source library Crawlee. You can keep the scraper to yourself or make it public by adding it to Apify Store (and find users for it). Or let us know if you need a custom scraping solution. ## Your feedback We’re always working on improving the performance of our Actors. So if you’ve got any technical feedback for Smart Article Extractor or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.
Common Use Cases
Market Research
Gather competitive intelligence and market data
Lead Generation
Extract contact information for sales outreach
Price Monitoring
Track competitor pricing and product changes
Content Aggregation
Collect and organize content from multiple sources
Ready to Get Started?
Try Smart Article Extractor now on Apify. Free tier available with no credit card required.
Start Free TrialActor Information
- Developer
- lukaskrivka
- Pricing
- Paid
- Total Runs
- 6,159,050
- Active Users
- 6,520
Related Actors
Google Search
by devisty
Twitter Tweets Scraper
by gentle_cloud
Twitter Profile
by danek
Google News Scraper
by lhotanova
Apify provides a cloud platform for web scraping, data extraction, and automation. Build and run web scrapers in the cloud.
Learn more about ApifyNeed Professional Help?
Couldn't solve your problem? Hire a verified specialist on Fiverr to get it done quickly and professionally.
Trusted by millions | Money-back guarantee | 24/7 Support