So you've got a Python and Selenium scraper working on your local machine. It's beautiful—it clicks buttons, fills forms, and extracts data like a dream. Then you try to scale it. Suddenly, you're facing detection, IP bans, and infrastructure headaches that make you question your life choices. Sound familiar?
I recently came across a fascinating Reddit discussion where someone shared their experience running a scraper for over two years across 50 nodes, collecting 3.9 million records from a major job site. The conversation that followed was pure gold—real developers sharing real problems and solutions. In this article, I'm going to expand on that discussion with everything I've learned from scaling similar projects. We'll cover the hard parts: detection avoidance, infrastructure management, and keeping everything running smoothly when you're dealing with serious volume.
The Reality of Modern Anti-Scraping Defenses
Let's start with the elephant in the room: modern websites don't want you scraping them. And they're getting really good at stopping you. The Reddit poster mentioned their target site was fingerprinting navigator.webdriver—that's just the tip of the iceberg.
In 2026, anti-bot systems have evolved far beyond simple user-agent checking. They're looking at hundreds of signals: your browser's fingerprint (canvas, WebGL, fonts), your interaction patterns (mouse movements, typing speed), even the timing of your requests. Some sophisticated systems use behavioral analysis that learns what "normal" human traffic looks like and flags anything that deviates.
What's interesting about the original discussion is the observation that headless mode got detected "faster than a visible browser." This isn't surprising when you understand how detection works. Headless browsers often have subtle differences in their JavaScript environment, missing certain properties or returning different values for browser APIs. Detection systems look for these discrepancies.
The takeaway here is simple: if you're serious about scraping at scale, you need to be serious about mimicking real browsers. Not just kinda-sorta, but down to the smallest details.
Why Full Chrome Beats Headless (And How to Make It Work)
The original poster mentioned running "full Chrome on each node" rather than headless. This is a crucial insight that many beginners miss. When you run Selenium with a visible browser, you're getting the complete Chrome environment—all the WebGL, canvas, and audio contexts that detection systems check.
But here's the practical problem: running 50 instances of full Chrome isn't trivial. Each browser instance needs memory (typically 200-500MB per instance), and you need a way to actually see them or manage them without a display. That's where virtual displays come in.
For Linux nodes (which you should probably be using for this scale), Xvfb (X Virtual Framebuffer) is your friend. It creates a virtual display that Chrome can render to without needing actual monitor hardware. The setup looks something like this:
from selenium import webdriver
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1920, 1080))
display.start()
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)
The visible=0 means you're running headless from a user perspective, but Chrome itself thinks it has a real display. This gives you the best of both worlds: the complete browser environment without the overhead of actual GUI rendering.
One more thing—memory management becomes critical at scale. Chrome is notorious for memory leaks in automated scenarios. You need aggressive process cleanup, restarting browsers regularly before they consume all your node's memory.
Fingerprinting Evasion: Beyond navigator.webdriver
So you've got full Chrome running. Great. Now you need to make it look like it's not automated. The Reddit poster mentioned overriding navigator.webdriver via JavaScript. That's step one, but it's far from sufficient in 2026.
Modern detection looks at dozens of properties. Here's what you actually need to address:
- WebDriver property:
Object.defineProperty(navigator, 'webdriver', {get: () => undefined}) - Chrome runtime: Override
window.chromeandwindow.navigator.chrome - Permissions API: Mock the permissions query to return realistic values
- Plugins and mimeTypes: Ensure these arrays aren't empty (headless often has empty arrays)
- Languages and platform: Make sure they match your user-agent
But here's the real kicker—timing. When you execute these overrides matters. If you run them after the page has already loaded detection scripts, you're too late. The detection scripts might have already stored the original values. You need to execute your evasion scripts as early as possible, ideally through Chrome DevTools Protocol (CDP) before any page scripts run.
In my experience, the most effective approach is combining CDP commands with middleware that intercepts and modifies responses. There are tools like puppeteer-extra and its stealth plugin that handle much of this, but with Selenium, you're often building these protections yourself.
Proxy Management at 50-Node Scale
Let's talk about the infrastructure nobody wants to discuss but everyone needs: proxies. When you're running 50 nodes hitting the same site, even with perfect browser emulation, you'll get banned based on IP patterns alone.
The original discussion didn't dive deep into proxies, but in the comments, people were asking the right questions: residential vs. datacenter, rotation strategies, cost management. Here's what I've learned from similar projects.
First, you need a proxy pool that's at least 5-10 times larger than your number of nodes. Why? Because you need to rotate IPs even within the same node's session. A single IP making hundreds of requests to a job site looks suspicious, no matter how human-like the browser fingerprint.
Residential proxies are gold for difficult targets—they come from real ISPs and blend in with normal traffic. But they're expensive. At 50 nodes running continuously, you could be looking at thousands per month. Datacenter proxies are cheaper but easier to detect and block.
My recommendation? Use a hybrid approach. Route your initial, most sensitive requests (login pages, search submissions) through residential proxies. Use datacenter proxies for the bulk data extraction where patterns matter less. And implement smart rotation: change IPs based on request count, not just time. Some sites track how long an IP has been active, not just how many requests it makes.
Infrastructure: Managing 50 Nodes Without Losing Your Mind
Okay, you've solved the technical detection problems. Now you need to actually run this thing across 50 machines. This is where many projects fall apart—the operational overhead becomes overwhelming.
The Reddit poster mentioned running for "2+ years," which tells me they figured out the operational side. Here's what that likely involves:
Orchestration: You need something to start, stop, and monitor your scrapers. Kubernetes is overkill for many scraping projects (though it works). Docker Swarm or even a well-designed systemd setup can work. The key is having a central manager that can deploy code updates and configuration changes to all nodes simultaneously.
Monitoring: At 50 nodes, things will fail constantly. Browsers will crash. Proxies will stop working. Nodes will lose network connectivity. You need comprehensive monitoring that alerts you not just when a node is down, but when its success rate drops below a threshold. I prefer Prometheus for metrics collection with custom exporters for scraping-specific metrics (captcha rate, ban rate, data quality).
Data pipeline: 3.9 million records don't store themselves. You need a robust pipeline that handles partial failures. If a node extracts 95% of a page then crashes, you shouldn't lose that 95%. Implement checkpointing—save progress as you go, not just at the end.
One pattern that's worked well for me: separate the extraction logic from the browser automation. Have lightweight "crawler" nodes that handle browser interaction and save raw HTML or JSON to a queue. Then have separate "parser" nodes that extract structured data from that queue. This way, if the site changes its layout, you don't lose historical raw data that you can re-parse later.
Rate Limiting and Polite Scraping
Here's something that doesn't get enough attention: being a good citizen. Even if you can avoid detection, you should still respect the target site. The Reddit poster was scraping a job site—presumably one that helps people find employment. Overwhelming their servers helps nobody.
Good rate limiting isn't just about avoiding bans—it's about sustainability. Your 2+ year run suggests they figured this out. Here's how to implement intelligent rate limiting:
First, observe the site's normal traffic patterns. How fast do real users click through pages? Add random delays that match those patterns. Not just fixed seconds between requests—vary it. Humans don't browse with metronome-like precision.
Second, implement backoff mechanisms. When you get a 429 (Too Many Requests) or a captcha, don't just retry immediately. Exponential backoff is your friend. And if you keep hitting blocks from a particular IP or user session, switch to a different proxy or restart the browser entirely.
Third, consider time-of-day patterns. If you're scraping a job site, traffic is probably higher during business hours. Mimic that pattern—scrape more slowly during peak hours, faster during off-hours. This not only looks more human, it reduces your impact on their infrastructure during their busiest times.
Error Handling and Data Quality
When you're collecting 3.9 million records, you're going to have errors. Pages will load differently. Selectors will break. The site will change. The difference between a hobby project and production scraping is how you handle these inevitable failures.
The first rule: never fail silently. Every exception should be logged with enough context to debug it later. That means capturing the URL, the HTML (or at least a snippet), the browser state, and what you were trying to do.
Second, implement retry logic with intelligence. Some errors are transient (network issues, temporary blocks). Some are permanent (page removed, selector broken). Your system should distinguish between them. Transient errors get retried (with backoff). Permanent errors get logged for manual investigation and possibly trigger an alert that the site might have changed.
Third, validate your data as you go. Don't wait until you have millions of records to discover that half of them are missing critical fields. Implement schema validation on extraction. If a job listing doesn't have a title or company, that's probably an extraction error, not actual data.
One technique I've found invaluable: regular sampling and manual review. Pick random records from your output and manually check them against the live site. Are you capturing everything correctly? This catches drift before it becomes a data quality disaster.
Common Pitfalls and How to Avoid Them
Let's wrap up with the mistakes I see most often—the ones that derail scaling efforts.
Pitfall #1: Underestimating maintenance. A scraper isn't a "set it and forget it" system. Sites change. Anti-bot systems evolve. Budget at least 20% of your time for maintenance and updates.
Pitfall #2: Hardcoding selectors. Use relative, robust selectors. Prefer CSS classes over XPath when possible. And have fallback selectors for critical data. If the primary selector for job title fails, try two or three alternatives before giving up.
Pitfall #3: No circuit breakers. If a site goes down or starts returning errors, your scraper shouldn't hammer it endlessly. Implement circuit breakers that pause scraping after too many failures.
Pitfall #4: Ignoring legal and ethical considerations. Check robots.txt. Review terms of service. Consider whether your scraping might harm the service. And for job sites specifically, be extra careful—people's livelihoods are involved.
Pitfall #5: DIY when you shouldn't. Sometimes, building everything yourself isn't the right choice. Platforms like Apify handle much of the infrastructure complexity, letting you focus on the extraction logic. For one-off projects or when you need to scale quickly, they can save months of development time.
Wrapping Up: Is It Worth It?
Scaling Python and Selenium to 50 nodes and millions of records is a serious undertaking. It's not just writing a script—it's building a distributed system with all the complexity that entails.
But when you need the data and there's no API, it's often the only way. The key is approaching it systematically: solve detection first with proper browser emulation, build robust infrastructure second, and always keep data quality and sustainability in mind.
The Reddit discussion that inspired this article showed something important—developers are still figuring this out, sharing knowledge, and pushing what's possible. In 2026, the cat-and-mouse game continues, but with the right approach, you can collect the data you need reliably and responsibly.
If you're just starting out, begin small. Get a single node working perfectly before you think about scaling. And if you hit a wall with infrastructure, remember that sometimes hiring an expert on Fiverr for the tricky parts can get you moving faster than struggling alone.
Happy scraping—and may your selectors always be stable.