Designing Dependable Distributed Systems
I used to keep two separate notebooks — one for first principles and one for "the rules," which were things like "clone your servers," "cache aggressively," "queue slow work." They stayed separate because I thought they lived at different levels of abstraction. Over time I've realized they're the same document, just written from different directions. This is that document, written in my own voice, because I learn better from writing than from reading.
A lot of this is textbook-adjacent. I'm not pretending to invent any of it. But the patterns land differently after you've been paged at 2am because of one of them.
What "dependable" actually means when you're on-call
In grad school, dependability was a property you proved mathematically. In production, it's the thing your phone rings about at night. Over time I've landed on four promises I make to the systems I build, and I use them as a sanity check when someone proposes adding something new.
The question isn’t ‘does it work when everything works?’ The question is ‘does it work when things are breaking in ways we didn’t anticipate?’
Post thisReliability means the system keeps producing correct results even when a disk dies, a process panics, or a dependency silently starts returning garbage. This sounds obvious and yet I've worked on systems where a single misbehaving dependency could corrupt state for hours before anyone noticed. The question isn't "does it work when everything works?" — that's easy. The question is "does it work when things are breaking in ways we didn't anticipate?"
Availability means people can use the product when they need it. Not 99.99% of the time in theory — in practice, when the engineering manager is giving a demo, when the customer's end-of-quarter deadline hits, when the traffic spike you didn't predict arrives. I've seen systems with impressive uptime numbers that were reliably unavailable at exactly the wrong moments because their failure modes weren't evenly distributed.
Scalability means capacity grows by adding copies, not by rewriting half the stack. I spent an embarrassing amount of time early in my career scaling systems vertically — bigger boxes, more RAM, faster disks — and then hitting ceilings that required architectural changes to cross. The best systems I've worked on were designed horizontal from the start. Not because that's theoretically purer, but because adding a node is much less stressful than emergency vertical scaling during an incident.
Maintainability is the one I've come to care about most, because it's the one that keeps getting undervalued. An on-call engineer who's never seen this service should be able to understand what it's doing and change it without fear. If they can't — if the system requires tribal knowledge to operate safely — you've built technical debt that compounds silently until someone quits or someone makes a mistake.
I design with Service Level Objectives defined up front. If a component doesn't advance at least one of those four promises, I question why it's there. This has helped me say no to some genuinely clever ideas that would have made the system more interesting and less reliable.
Layered fault tolerance — the armor metaphor that actually stuck
A senior engineer I worked with years ago described fault tolerance as layered armor, and that framing stuck. You're not trying to prevent all failures. You're trying to ensure that failures at one layer don't cascade through all the layers at once.
At the infrastructure layer: multi-zone deployments, autoscaling groups, and immutable server images. Sessions and uploads never live on a single node — if they do, losing that node means losing user state, and I've seen that happen in ways that were very hard to explain to users. Redis or object storage carry that state. Nodes are cattle, not pets. I've worked on a legacy system where every server had a name and a personality and a specific configuration that wasn't documented anywhere. Migrating off that was a months-long project that should never have been necessary.
At the platform layer: load balancers, service meshes, and deployment pipelines that make rollouts reversible. Blue/green deployments saved me from a bad production push more times than I care to count. Canary deployments gave me the confidence to ship to production without losing sleep. Feature flags let me decouple deployment from release, which sounds like a minor thing until you've shipped a feature to all users at midnight that turned out to have a significant bug.
At the application layer, this is where I see the most variation in practice. Idempotent handlers are non-negotiable for anything that writes to a database or calls an external API — if a request gets retried, the result should be the same. Retries with jitter prevent thundering-herd problems when a dependency comes back up after a blip. Request hedging on latency-sensitive paths can dramatically reduce tail latency, though it increases cost and requires careful implementation. Graceful degradation — showing stale data rather than an error page, returning partial results rather than timing out — is the difference between a degraded experience and a broken one.
And at the human layer, which textbooks usually omit: peer-reviewed runbooks, dry-run migrations, and chaos drills that keep operator mistakes reversible. The socio-technical system breaks in ways that purely technical systems don't. I've seen incidents caused by documentation that was subtly wrong, by runbooks that hadn't been tested since a major infrastructure migration, by engineers who were too nervous to escalate because the culture didn't make escalation safe. These aren't engineering problems in the traditional sense, but they're reliability problems.
The scaling patterns I keep returning to
Stateless application servers behind smart load balancers. This is the foundation of everything else. If your application server holds state — in memory, on local disk — you can't clone it, you can't replace it gracefully, and you can't drain traffic from it without user impact. I've had to architect my way out of stateful server designs more than once, and it's never fun. Build stateless from the start.
Leader-follower database setups with read replicas are the first thing I reach for when database load becomes a problem, which it always does eventually. It's not a perfect solution — replication lag can cause consistency issues that you need to handle explicitly — but it's solved-enough that the tradeoffs are well-understood. Sharding comes later, when replicas aren't enough, and it introduces real complexity. I've sharded databases and I'd rather not do it again if I can avoid it by right-sizing early.
Caching is where I've made the most mistakes. The obvious mistake is not caching enough. The less obvious mistake is caching incorrectly — stale reads that cause silent correctness issues, cache invalidation logic that's more complex than the thing it's caching, node-local caches that fight autoscaling because they can't be shared. I've landed on explicit invalidation, Redis for shared caches, and a policy of treating cache state as ephemeral rather than authoritative. If the cache dies, the system should degrade gracefully to hitting the database, not fail entirely.
Asynchronous pipelines for anything that takes more than a few hundred milliseconds. One of the best decisions I made on a data processing project was moving document processing off the request path and onto a Kafka consumer. The synchronous version worked fine at low volume and became completely unworkable as traffic grew. The async version was easier to scale, easier to monitor, and easier to replay when bugs in the processing logic required reprocessing historical documents — which happened twice. Kafka, SQS, Redis Streams — the specific technology matters less than the pattern.
Rate limiting and backpressure at the edge, before requests reach services that can't handle load spikes gracefully. This sounds like an ops concern but it's really an architecture decision. Services that expect to be protected by upstream rate limiting are fragile; services that can absorb load spikes through their own backpressure mechanisms are much more robust in practice.
Distributed patterns that show up everywhere
Service discovery sounds boring until you've debugged a production issue caused by a hardcoded IP address in a config file that nobody remembered existed. I've seen this. Consul, etcd, AWS AppConfig — the specific tool matters less than having a single source of truth for service locations and configuration, and a way to roll configuration changes safely without redeploying services.
Leader election comes up more than you'd expect. Scheduled jobs that should only run on one instance. Metadata management that requires a single authoritative writer. Cache warming processes that shouldn't race. I've used Raft-based distributed lock services for this and I've also implemented simpler database-backed election mechanisms for lower-stakes cases. The database-backed version is easier to reason about and sufficient for most use cases, though it can become a bottleneck if the election happens frequently.
Sagas for distributed transactions — this is one I came to late. For years I tried to make distributed transactions work with two-phase commit and it was always more complex and less reliable than I hoped. Saga orchestration, where each service step has a corresponding compensating transaction for rollback, is more code upfront but dramatically easier to operate and debug. The mental model shift is from "make it atomic" to "make it reversible," and that shift has real implications for how you design the happy path.
Circuit breakers I now add almost reflexively to any external service call. The pattern is well-understood: count failures, open the circuit when failures exceed a threshold, try a test request after a cooldown period, close the circuit if it succeeds. The harder question is what to do when the circuit is open — fail fast, return stale data, try a fallback? That decision is domain-specific and worth making explicitly rather than discovering implicitly during an incident.
What cloud-native infrastructure taught me
Running workloads on Kubernetes taught me that the gap between "it works in a container" and "it works reliably in a cluster" is significant and largely invisible until it isn't. Pod eviction tests, latency injection, dependency black-holing — these aren't exotic chaos engineering; they're things that actually happen in production clusters and that you want to have seen before they happen to real traffic.
Observability became genuinely non-negotiable for me after spending two hours debugging a latency spike in a distributed system with no distributed tracing. Structured logs, metrics, and traces should ship alongside every significant PR, not as a follow-up task after the feature is "done." Correlation IDs that follow a request across service boundaries are the difference between "I can see the problem" and "I know there's a problem somewhere in these five services." The second experience is deeply unpleasant.
Cost as a reliability signal is something I didn't appreciate early enough. A spike in spend often points to runaway fan-out or unbounded retries long before those problems surface as user-visible failures. I've started keeping finance dashboards next to reliability dashboards and treating unusual cost behavior as worth investigating even when there's no corresponding incident.
How I keep systems dependable over time
Quarterly SLO reviews with product partners. This sounds administrative but it's actually important — SLO targets drift relative to business reality, and a target that was meaningful a year ago might be too lenient or too strict now. The conversation itself is valuable because it forces alignment on what "good enough" actually means.
Automated drift detection. Infrastructure configuration drifts. Terraform plan bots that run on every PR catch this automatically rather than requiring a human to notice that something changed. Policy-as-code for security and network configuration. If it's not reviewable as code, it will eventually be wrong.
Load tests and chaos drills before major launches, not after. Every time I've skipped this "just this once" because of schedule pressure, I've regretted it. The time required to write a meaningful load test is small compared to the time required to debug a production incident at scale.
Runbooks, architecture decision records, and dashboards stored with the service, not in a separate wiki that gets out of date. The service should be self-documenting enough that someone unfamiliar with it can understand what it does, why it was built the way it was, and how to debug the most common failure modes. This is aspirational — most services fall short of it — but it's worth aiming for.
Dependable distributed systems aren't the product of a single clever design decision. They're the product of layering patterns carefully, keeping feedback loops short, and being honest about what you don't know when you're designing something new. I update this page when production teaches me something I hadn't anticipated — which still happens regularly enough to keep me humble.