30+ Node.js Microservices: The Mistakes That Cost Me Weekends (And Money)
Let me tell you something you won't hear in most tutorials: building microservices in Node.js is easy until it isn't. The first few services? They feel like victories. The next twenty? That's when the real education begins.
I've spent six years building production Node.js services for multi-tenant SaaS platforms—the kind that handle real traffic, real users, and real money. Along the way, I've made mistakes that cost me weekends, cost companies money, and taught me lessons you can't find in documentation. Some of these errors are subtle. Others hit you like a truck at 3 AM.
This isn't another theoretical guide. This is what happens when you've deployed 30+ services and learned the hard way what actually matters in production. We're talking about the stuff that breaks when you scale, the assumptions that fail under load, and the practices that separate functional services from resilient ones.
The Graceful Shutdown Trap: Why Day 1 Matters
Here's a scenario that's probably familiar: your Node process gets a SIGTERM from Kubernetes or Docker. Maybe you're scaling down, maybe there's a deployment. If you're not handling it properly—and I mean properly—you're dropping in-flight requests like they're hot potatoes.
I learned this lesson the expensive way. We had a payment service that would occasionally lose transactions during deployments. Not many—just enough to be statistically significant and absolutely terrifying. The problem? No graceful shutdown handler. When Kubernetes sent the termination signal, the process just... died. Any requests in progress? Gone. Any database transactions mid-commit? Tough luck.
Graceful shutdown isn't a "nice to have" feature you add later. It's a day-one requirement. Every single service needs it from the moment it touches production. The implementation isn't complicated, but the mindset shift is crucial. You need to think about your service's lifecycle from the very beginning.
What Actually Happens During Shutdown
When your container orchestrator decides to terminate your pod, it sends a SIGTERM signal. You get a brief window (typically 30 seconds in Kubernetes) to clean up before SIGKILL arrives. During that window, you need to:
- Stop accepting new connections
- Finish processing existing requests
- Close database connections cleanly
- Release any resources (file handles, locks, etc.)
- Log the shutdown for observability
Miss any of these, and you're creating potential data corruption, user errors, or resource leaks. The worst part? These issues often only surface under specific conditions—like high traffic during deployments—making them incredibly difficult to debug.
The Monitoring Mirage: When "Working" Isn't Enough
Here's another painful truth: most monitoring setups are theater. They tell you when things are completely broken, but they're silent about the slow degradation that actually kills user experience.
Early in my microservices journey, I'd set up basic health checks and call it a day. The service returns 200? Great, it's healthy. But that's like saying a car is "working" because the engine starts—never mind that it's leaking oil and the brakes are fading.
Real monitoring—the kind that actually helps—needs to track business metrics, not just technical ones. Response time percentiles (p95, p99), error rates by endpoint, database connection pool utilization, queue depths if you're using message brokers. These are the metrics that tell you when things are about to break, not when they already have.
And logging? Don't get me started. Structured logging isn't optional anymore. If you're still using console.log with string concatenation in 2026, you're making your future self miserable. Every log entry needs context: request IDs, user IDs, timestamps with time zones, service names. Without this, debugging distributed systems becomes a nightmare of correlation.
Configuration Management: The Silent Killer
How do you manage configuration across 30 services? If your answer involves environment variables in Dockerfiles or—heaven forbid—hardcoded values, I've got bad news for you.
Configuration drift is one of those problems that creeps up on you. One service uses a different Redis timeout. Another has a different database connection pool size. A third uses slightly different retry logic for external API calls. Individually, these differences seem harmless. Collectively, they create unpredictable behavior that's impossible to reason about.
The solution isn't necessarily a complex configuration management system (though those exist). It's consistency. Pick patterns and stick to them across all services:
- Use the same library for configuration loading (dotenv, convict, etc.)
- Validate configuration at startup, not at runtime
- Use the same naming conventions for environment variables
- Document configuration options in the same place (README or dedicated config docs)
And for the love of all that's holy, don't store secrets in environment variables that get committed to source control. Use a secrets manager or at least encrypted secrets that get decrypted at runtime.
Error Handling: Beyond Try/Catch
Here's something I wish someone had told me earlier: error handling in microservices isn't about preventing errors—it's about managing them gracefully. Things will fail. Databases will have connection issues. External APIs will timeout. Networks will partition.
The mistake I made early on was treating every error as exceptional. If the database was down, the service would crash. If an external API returned 500, we'd throw an error up the stack. This creates brittle systems that fail catastrophically instead of degrading gracefully.
What you actually want are patterns like:
- Circuit breakers for external dependencies
- Retry logic with exponential backoff
- Fallback mechanisms when primary services are unavailable
- Dead letter queues for messages that can't be processed
These patterns acknowledge that failure is normal. They let your service handle partial outages without becoming part of the problem. And they give you time to fix issues before they affect every user.
Dependency Management: The Versioning Nightmare
Let's talk about node_modules. Specifically, let's talk about what happens when you have 30 services, each with slightly different versions of the same dependencies.
Early on, I'd update dependencies in each service independently. "This service needs the latest version of Express, so I'll update it here." Sounds reasonable, right? Until you realize you're running 12 different versions of Express across your ecosystem, each with slightly different behavior and security patches.
The solution? Lock things down. Use package-lock.json or yarn.lock religiously. Consider using a monorepo tool like Lerna or Nx to manage dependencies across services. Or at minimum, create a shared "base" Docker image with common dependencies pre-installed.
And while we're on the topic: audit your dependencies regularly. npm audit isn't perfect, but ignoring it is asking for trouble. Those "low severity" vulnerabilities add up, especially when they're in transitive dependencies three levels deep.
Testing: What Actually Matters
Here's a controversial opinion: unit tests are overrated for microservices. Don't get me wrong—they're useful. But they're not where you should spend most of your testing effort.
What matters more? Integration tests that verify your service works with its actual dependencies. Contract tests that ensure you don't break API compatibility. End-to-end tests that simulate real user flows. These are the tests that catch the bugs users actually experience.
The mistake I made was focusing too much on unit test coverage metrics. "We have 90% coverage!" Great. But if those tests mock away all the external dependencies, they're not telling you whether your service actually works in production.
My testing philosophy now: start with the integration tests. Make sure the service can actually talk to its database, message queue, and external APIs. Then add unit tests for complex business logic. Finally, add a few critical path end-to-end tests that verify the most important user journeys.
Documentation: The Thing You'll Wish You Had
Nobody likes writing documentation. I get it. But when you're trying to debug a production issue at 2 AM, and you can't remember how service A talks to service B, you'll wish you had written something down.
The key is to document just enough—not everything. Focus on:
- Architecture diagrams showing service relationships
- API documentation (OpenAPI/Swagger is your friend)
- Deployment procedures and rollback steps
- Common troubleshooting scenarios
- Ownership and escalation paths
And here's a pro tip: document decisions, not just facts. Why did you choose RabbitMQ over Kafka? Why is the timeout set to 30 seconds? Why does this service have a different authentication mechanism? This context becomes invaluable when new team members join or when you're considering changes years later.
The Human Factor: Scaling Yourself
Here's the thing nobody talks about: managing 30+ microservices is as much about scaling your own cognitive load as it is about scaling infrastructure.
Early on, I could hold the entire system in my head. I knew how every service worked, where the data flowed, what could break. Around service number 15, that started to break down. By service 25, it was impossible.
You need systems to manage the complexity:
- Standardized naming conventions (seriously, this matters more than you think)
- Centralized logging and metrics (you can't debug what you can't see)
- Service catalogs or registries
- Automated dependency graphs
- Runbooks for common operations
And sometimes, you need to acknowledge when a microservice architecture has become too fragmented. Not every piece of functionality needs its own service. Sometimes, a well-modularized monolith is easier to manage than a swarm of microservices.
Common Questions (And Real Answers)
Let me address some questions that come up constantly:
"When should I split a service?" When you have a clear bounded context that changes for different reasons than the rest of the system. Not when you think it's "too big." Size alone is a terrible metric.
"How do I handle shared code?" Carefully. Shared libraries can create tight coupling that defeats the purpose of microservices. If you must share code, keep it minimal and version it independently.
"What about database per service vs shared database?" Start with a shared database if you're small. Migrate to databases per service when you have clear boundaries and the operational overhead to manage them. Premature database splitting creates more problems than it solves.
"How do I debug distributed transactions?" You don't. You avoid them. Use eventual consistency patterns instead. Distributed transactions are the quickest path to production nightmares.
Wrapping Up: What Actually Matters
After six years and 30+ services, here's what I've learned: the technical details matter, but the principles matter more. Consistency beats cleverness every time. Observability is non-negotiable. And failure isn't something to prevent—it's something to design for.
The biggest mistake isn't any specific technical error. It's the assumption that microservices make things simpler. They don't. They trade implementation complexity for operational complexity. They give you flexibility at the cost of coordination overhead.
My advice? Start with fewer services than you think you need. Get the fundamentals right on those first services—graceful shutdown, proper monitoring, consistent configuration, thoughtful error handling. Then, and only then, consider splitting.
Because here's the truth: building microservices is easy. Building good microservices—the kind that don't wake you up at night—that's the real challenge. And it starts with learning from other people's mistakes so you don't have to make them all yourself.