Here's the uncomfortable truth that keeps infrastructure engineers awake at night: most companies are terrible at measuring reliability.
Not uptime. Uptime is easy. Uptime is binary—your service is either responding to health checks or it isn't. The real killers are the things that don't trigger pager alerts: intermittent latency spikes, partial failures that affect only certain users, resource exhaustion that seems like normal traffic, and the slow degradation that happens so gradually nobody flags it.
Meanwhile, the cost of actual downtime keeps climbing. The latest figures put the average cost at $5,600 per minute for midsize businesses—up from $5,000 just two years ago. Large enterprises can see $23,750 per minute during peak incidents. And 39.7% of organizations report that downtime costs have increased over the past year.
The SRE Report 2026 confirms what I've seen in the field: teams report more confidence in their reliability practices than ever before, yet median toil remains stubbornly high. We're getting better at the wrong things—celebrating uptime metrics while the real problems hide in plain sight.
The Four Types of Invisible Downtime
When I audit infrastructure, I categorize reliability problems into four buckets. Three of them don't show up in your uptime dashboard:
Type 1: Partial Failure (The "Everything Works... Mostly" Problems)
Your load balancer shows healthy. Your health checks pass. But 12% of requests are timing out after 30 seconds because one upstream service is struggling. Or your database connection pool is exhausted, causing intermittent failures that look like user errors in your logs.
I worked with an e-commerce company that thought they had 99.95% uptime. What they actually had: a payment processor that was failing silently for 3% of transactions, returning generic error messages that looked like user input problems. They'd been losing $40,000 monthly in abandoned carts for eight months before we spotted the pattern.
The detection problem: These failures don't trigger traditional alerts because something is technically responding. You need end-to-end transaction monitoring, not just infrastructure health checks.
Type 2: Performance Degradation (The Slow Death)
Your API still answers, but it takes 4.2 seconds instead of 200ms. Your database queries that ran in 50ms now take 800ms. Your page load time crept from 1.5 seconds to 6 seconds over the course of a year.
Amazon famously calculated that every 100ms of latency cost them 1% in sales. Google found that increasing page load time from 0.4 to 0.9 seconds reduced traffic by 20%. Those numbers are old—user expectations have only gotten more demanding.
Here's what I see constantly: teams optimize for p50 latency (the median) while their p99 latency balloons. "Most users are fine" isn't a reliability strategy—it's a way to quietly lose your most impatient, high-value customers.
Type 3: Noisy Failure (The Alert Fatigue Problem)
Your monitoring system is screaming. It's been screaming for months. Everyone on the team has learned to ignore it because "those alerts are always firing."
When everything is urgent, nothing is. An engineering team I worked with had 2,400 alerts per day across Slack, PagerDuty, and email. Their actual on-call response time? 47 minutes when the paging system worked, longer when it didn't. They'd trained themselves to ignore the very systems meant to protect them.
The real downtime happens when a critical alert gets buried in the noise and nobody responds until customers start calling.
Type 4: Dependency Blindness (What You Don't Monitor)
Your infrastructure is fine. Your code is fine. But your third-party payment processor had an outage, your CDN is serving stale cache, your identity provider is rate-limiting authentication requests, or your cloud provider's DNS is having issues.
Modern architectures have dozens of external dependencies. Most teams monitor their own systems exhaustively while treating third parties as black boxes. When those black boxes fail, you get the worst of both worlds: your systems look healthy, but your users can't do anything.
The Reliability Audit Framework
If you want to find the downtime you don't know you're having, you need to look in places your current monitoring doesn't cover. Here's the 6-step framework I use:
Step 1: Map Your Critical User Journeys
Start with what matters. Not what your infrastructure thinks matters—what your users actually do.
Time investment: 2-3 hours with stakeholders. Typical discovery: 30-50% of critical paths have monitoring gaps.
Step 2: Audit Your Alert Quality
Alert fatigue is a reliability problem. When teams can't distinguish signal from noise, they miss actual incidents.
Target ratios: 5:1 signal-to-noise for warning alerts, 10:1 for paging alerts. If you're firing more than 50 alerts per day, you have a noise problem.
Go through your last 100 alerts. Categorize each:
- Actionable: Required immediate intervention
- Informational: Useful context, no action needed
- False positive: System was healthy, alert fired anyway
- Ignored: Known issue, team consciously took no action
If more than 20% are anything except "actionable," you need alert tuning. Delete thresholds that fire constantly. Fix flaky checks. Adjust sensitivity. An alert that's always ignored should be a warning or deleted entirely.
Step 3: Measure End-to-End, Not Just Infrastructure
Your server can show 20% CPU while your users experience 10-second timeouts. You need synthetic monitoring that exercises real user journeys from outside your infrastructure.
Key metrics to track per critical journey:
- Success rate: Percentage of journeys that complete successfully
- Latency percentiles: p50, p90, p99, p99.9 response times
- Error breakdown: Where failures happen (frontend, API, database, third party)
- Business impact: Conversion rates, completion rates, abandonment
Step 4: Implement SLOs With Realistic Targets
Service Level Objectives aren't about perfection—they're about explicit trade-offs. "This system should be 99.99% available" sounds good until you realize it allows 52 minutes of downtime per year, requires expensive redundancy, and might not actually matter to users.
Good SLOs start with user happiness, not technical perfection:
- Would users notice if this degraded?
- What's the business cost of degradation?
- What's the engineering cost of preventing it?
- Is there a cheaper way to mitigate?
I worked with a team spending $200,000 annually on multi-region redundancy for a service with 50 daily active users. Their actual user impact from regional outages? Zero—users just waited and retried. We cut their infrastructure spend 60% by accepting 99.9% instead of 99.99%.
Step 5: Track Error Budget Burn
Here's the counterintuitive thing about reliability: perfect uptime is usually wrong. You want intentional unreliability—brief, controlled, understood.
Error budgets formalize this. If your SLO is 99.9% availability, you have a 0.1% error budget—about 43 minutes per month. Use it intentionally for deployments, experiments, and lower-priority maintenance. But when you burn through it unexpectedly, stop feature work and focus on reliability.
Most teams skip error budgets because "we can't afford downtime." What they're really saying is "we can't measure the trade-off, so we pretend perfection is free." It's not. Chasing 99.999% uptime for everything consumes engineering resources that could build actual customer value.
Step 6: Build Observability, Not Just Monitoring
Monitoring tells you when things are broken. Observability lets you understand why.
The difference matters. Monitoring is dashboards and alerts—predefined metrics for known failure modes. Observability is the ability to ask arbitrary questions about your system's behavior: "Show me all requests from user 18472 in the last hour that had latency >2s."
Three capabilities you need:
- Distributed tracing: Follow a request through every service it touches
- Structured logging: Consistent, queryable log formats with correlation IDs
- High-cardinality metrics: Break down by user, endpoint, region, not just aggregate
DevOps teams that prioritize these capabilities solve incidents in hours instead of days. The 2026 data shows 61% of IT professionals rank automation as a high priority—but observability investments often deliver bigger reliability returns.
The Hidden Cost Calculator
Let's put numbers on this. Here's how to estimate what invisible downtime costs your business:
Most companies who run this calculation are shocked. That 2% partial failure rate that "doesn't really count as downtime"? At $100,000 daily revenue, that's $60,000 monthly in lost conversions. The endless stream of "quick fixes" that consume 15 engineering hours weekly? That's another $12,000 in reactive labor costs.
The companies winning on reliability aren't the ones with the best uptime numbers. They're the ones who measure the right things and invest in prevention over reaction.
The Platform Engineering Connection
Here's where platform engineering changes the game. Individual teams can't be expected to build observability expertise, tune alerts effectively, and maintain SLO discipline while shipping features. The cognitive load is too high.
High-maturity organizations—79% of them by latest counts—are moving to platform engineering models where reliability standards are baked into the platform. Golden paths include built-in observability. Deployment pipelines include automated canary analysis. Error budgets are tracked automatically.
The 36% of high-maturity organizations automating 61%+ of deployments from commit to production? They didn't get there by asking every team to figure it out themselves. They built platforms that made the right thing automatic.
If you're serious about reliability, you have two choices: hire a reliability engineer for every product team, or build platform capabilities that make individual teams reliably excellent by default. Only one of those scales.
Your Reliability Action Plan
Here's what I'd do this week if I were responsible for your infrastructure:
Day 1-2: Run the critical user journey mapping exercise. Identify your top 3 reliability blind spots.
Day 3: Audit your last 100 alerts. Delete or fix the noisy ones.
Day 4: Implement synthetic monitoring for at least one critical journey.
Day 5: Draft your first SLOs. Start realistic—you can tighten them later.
This month: Calculate your invisible downtime cost. Use that number to justify observability investments.
The companies taking reliability seriously aren't chasing perfect uptime dashboards. They're building systems that fail gracefully, recover quickly, and give engineers the tools to understand problems before users notice them.
Your infrastructure is already more reliable than you think—but only if you measure the right things.
Want help with this?
I'll audit your reliability practices and identify the hidden downtime that's costing you money. Most audits find 15-40% improvement opportunities.
Based in Detroit. Serving infrastructure globally.