Your Infrastructure Is Lying to You About Utilization

Three days ago, CAST AI published their annual State of Kubernetes Optimization Report. They analyzed tens of thousands of production clusters running on AWS, GCP, and Azure—measured directly, not estimated.

The results weren't just disappointing. They were worse than last year.

CPU utilization dropped from 10% to 8%. Memory utilization fell from 23% to 20%. And for the first time, they measured GPU utilization: a staggering 5%.

20x Organizations provide 20x more GPU capacity than they actively use

I've been auditing Kubernetes clusters for four years. These numbers don't surprise me, but they should embarrass our industry. We're not getting better at this. We're getting worse.

The Structural Problem Nobody Wants to Admit

Here's what changed between 2025 and 2026. CPU overprovisioning jumped from 40% to 69%. Memory overprovisioning now sits at 79%. You're paying for infrastructure your workloads don't even request.

Why? The mechanics are depressingly simple:

Step one: Developers pad resource requests to avoid throttling and OOM evictions. Understandable. Nobody wants their pod killed at 2 AM because of a memory spike.

Step two: The cost of that padding is invisible to the team requesting it. There's no meter showing "this memory request costs $847/month." Just a YAML file and a vague sense of being "safe."

Step three: There's no systematic process to revisit those requests. Set it and forget it. The 4Gi memory limit you configured for a spike in March 2023? Still there. Still billing you every hour of every day.

Meanwhile, 93.15% of top-performing engineering organizations use internal developer platforms with built-in resource controls. Only 1.88% of low-performing teams do. That's not a coincidence.

69% CPU overprovisioning in 2026—up from 40% in 2025

The GPU Waste Crisis

Let's talk about the 5% GPU utilization figure for a moment because it's the most grotesque number in the report.

NVIDIA H100 GPUs cost $2-3 per hour on-demand. An A100 runs $1-2 per hour. At 5% utilization, you're paying full price for hardware that's sitting idle 95% of the time.

I audited a machine learning startup last month. They were running 40 H100s for model training. When we dug into the metrics, their actual GPU compute utilization over a 30-day period averaged 4.7%. The rest was initialization overhead, data loading bottlenecks, and waiting for the next job to start.

They were burning $57,000 per month on GPUs to get $2,850 worth of actual compute.

This isn't unique to ML workloads. Any GPU-accelerated workload—video encoding, scientific simulation, rendering—shows the same pattern. Organizations buy capacity for peak theoretical demand, then run it at a fraction of capacity day-to-day.

NVIDIA H100 cost (on-demand) $2.50/hour

Average utilization 5%

Effective cost per utilized hour $50.00/hour

Monthly waste (per GPU at 5% util) $1,710

The Platform Engineering Fix

Here's the uncomfortable truth: You can't fix this with quarterly cost reviews. The gap between when waste is created and when it's discovered is too long. By the time Finance flags a high bill, you've already burned six months of waste.

The only sustainable solution is treating infrastructure as a product. That's what platform engineering is really about.

Organizations with mature internal developer platforms don't see 8% CPU utilization. They see 40-60% because their platforms enforce guardrails by default:

Resource quotas per namespace with alerts at 80% usage
Default request templates that start small and scale based on actual usage
Automated right-sizing recommendations surfaced directly to developers
Cost attribution by workload visible in every deployment dashboard

When developers can see that their 4Gi memory request costs $847/month, they make different choices. When the platform suggests a 1Gi starting point with automatic scaling, they take it. The 93.15% of elite teams using IDPs aren't special—they're just systematic about something everyone else treats as an afterthought.

The 4-Week Fix Framework

You don't need to rearchitect everything. You need visibility, then action. Here's what works:

Week 1: Expose the Truth

Install a cost visibility tool. Kubecost, OpenCost, or your cloud provider's native solution. The goal isn't optimization yet—it's making the invisible visible.

Install cost attribution tool across all clusters

Generate cost breakdown by namespace and workload

Identify top 10 most expensive deployments

Calculate average utilization vs. requested resources

Expected outcome: You'll find 2-3 workloads accounting for 30-40% of your bill with sub-20% utilization.

Week 2: Kill the Zombies

Before you optimize live workloads, eliminate the dead weight. These are pure wins with zero risk.

Delete abandoned namespaces from completed projects

Remove unbound PersistentVolumes (PVCs with no pods)

Delete unused LoadBalancer services ($15-20/month each)

Clean up old container images (especially GPU base images)

Expected outcome: 10-15% immediate bill reduction with zero functional impact.

Week 3: Right-Size the Live Workloads

Now tackle the overprovisioning. This requires actual analysis, not guesswork.

For each high-cost workload, pull 30 days of metrics. Calculate peak usage, average usage, and p95 usage. Your resource requests should target p95, not peak-plus-margin.

The formula that works: Request = p95(actual usage) × 1.2. Not peak × 2. Not "whatever feels safe." p95 × 1.2.

Set limits higher than requests—typically 2-3x—to handle real spikes. But don't over-request on the front end. Kubernetes schedules based on requests, not limits. Over-requesting equals paying for capacity you don't use.

Expected outcome: 20-30% reduction in provisioned capacity as workloads shrink to actual needs.

Week 4: Implement Spot and Savings Plans

For fault-tolerant workloads—CI/CD runners, batch jobs, stateless microservices, development environments—migrate to Spot instances. The 60-90% savings is real.

For predictable baseline capacity, commit to Reserved Instances or Savings Plans. A 1-year commitment typically saves 30-40%. Three years can hit 50-60%.

The key is segmentation. Don't treat all workloads the same. Your production database and your staging CI runner have completely different reliability requirements.

Expected outcome: Additional 40-70% savings on compute costs for qualified workloads.

The Measurement Problem

Here's why this keeps happening: Only 6.9% of teams achieve a rework rate below 2%. Most organizations are constantly fighting fires, dealing with technical debt, and shipping features to stay competitive. Infrastructure efficiency falls to the bottom of the priority list.

Until Finance shows up asking why the cloud bill doubled.

The fix isn't hiring a Kubernetes expert to audit your cluster every quarter. The fix is making efficiency visible and automated. When developers see cost metrics in their deployment pipeline, they optimize as a side effect of normal work. When platform teams enforce reasonable defaults, waste never accumulates.

The 2026 CAST AI report should be a wake-up call. Infrastructure utilization isn't just getting worse—it's getting worse faster. The gap between elite performers and everyone else is widening.

The Real Talk

Let me be direct: Your cluster is not special. Your workloads are not uniquely demanding. If you're seeing 8% CPU and 20% memory utilization, you're not being conservative—you're being wasteful.

Every percentage point of utilization you gain is money back in your budget. Every GiB of right-sized memory is funding you can redirect toward actual innovation.

The companies winning right now aren't running bigger clusters. They're running smarter ones. They've internalized what the CAST AI data proves: efficiency is a competitive advantage, and waste is a choice.

You can choose differently. Start this week.

Want help auditing your Kubernetes utilization?
I'll identify your 10 biggest optimization opportunities in 48 hours. Typical first audits find 30-50% waste.

clide@butler.solutions

Based in Detroit. Serving infrastructure globally.