I want to share a number that kept me up at night recently: 54% of enterprises now run more than 100 Kubernetes clusters. Some of those clusters host thousands of workloads, all competing for the same CPU and memory. That's an operational complexity that no manual process can handle.
And yet, according to a survey of 321 Kubernetes practitioners at organizations with over 1,000 employees โ published this week by CloudBolt Software โ only 17% can continuously optimize the infrastructure their clusters run on. Even though 89% say automation is crucial.
The gap isn't philosophical. It's practical. And expensive.
๐ก The hard truth: 69% of teams report that manual optimization breaks down before approximately 250 changes per day. If you're running 100+ clusters, you're past that threshold already.
Why Teams Know Better But Don't Do Better
Here's where it gets interesting. The survey isn't telling us that IT leaders are ignorant about automation. On the contrary โ 59% have already achieved automatic deployment to production. They understand the value of removing humans from repetitive tasks.
But when it comes to the next phase โ actually optimizing those resources automatically โ 71% require human review before applying any type of resource optimization.
Think about that. You're paying engineers to manually review every CPU rightsizing decision. Every memory adjustment. Every storage allocation. At scale, that's not cost-conscious management. That's bottlenecks dressed up as caution.
So what's holding them back? The survey asked that question too.
48% said visibility and transparency would most increase their trust in automation. They don't know what the system is doing or why, so they default to manual review.
25% wanted proven guardrails before they'd let automation make changes.
23% needed instant rollback capabilities to feel comfortable.
The theme is clear: Teams aren't afraid of automation itself. They're afraid of automation without safety nets.
The Real Cost of Doing Nothing
Let's talk numbers. The global cloud computing market hit $912.77 billion in 2025, up from $156.4 billion just five years earlier. Cloud infrastructure is no longer a side expense โ for many companies, it's one of the largest line items in their budget.
And how much of that spend is wasted? Studies consistently show that enterprises waste 30-40% of their cloud spending through over-provisioning, idle resources, and underutilized commitments. Even with FinOps adoption, recent data shows 29% of cloud budgets remain wasted โ primarily from underutilized Reserved Instances and Savings Plans that never get adjusted as workloads evolve.
Here's a simple math problem: If your cloud bill is $100,000/month and you're at the conservative end of waste (30%), that's $30,000 monthly going to resources you don't need. If your team's bottleneck is manual review of optimization recommendations, you're trading $30,000/month in waste against the time of engineers who cost you $15,000+ per month each.
The economics don't work. And they don't scale.
๐ฏ Key insight: The cost of "being careful" with manual reviews has quietly become more expensive than the occasional over-provisioning that automation might (briefly) cause.
The 90-Day Automation Roadmap
You don't need to flip a switch and go full autopilot on day one. The teams that succeed at continuous optimization take a phased approach. Here's the framework I use with clients to close the 89%-to-17% gap in 90 days:
Build Visibility First
Remember: 48% of practitioners said visibility would increase their trust. Before you automate any changes, automate the knowledge of what's there.
Implement comprehensive tagging. Cost allocation tags, environment tags, owner tags, workload-type tags. If your clusters are shared infrastructure across teams, you need to know who is creating what and when. Untagged resources should trigger alerts, not just policy violations.
Deploy real-time cost dashboards. Tools like Kubecost, OpenCost, or cloud-native solutions give engineers immediate feedback on what their deployments actually cost. Visibility creates accountability. Accountability creates behavior change.
Establish baseline metrics. CPU utilization patterns. Memory pressure signals. Storage growth rates. You cannot optimize what you haven't measured. Document these so you know if your later automation is actually improving things or just moving numbers around.
Phase 1 Checklist
- Mandatory tagging policy with automated enforcement
- Cluster-level cost visibility for every namespace and workload
- Baseline metrics established for the top 20 workloads by cost
- Alerting rules for untagged resources and cost anomalies
- Weekly cost review meeting scheduled with stakeholders
Automate the Safe Decisions
Now you know what you have and what it costs. Time to automate the recommendations that carry zero risk โ the ones engineers were going to approve anyway.
Start with idle resource detection. Resources with zero utilization for 7+ days get automatically tagged for review. After 14 days, they get automatically scaled to zero. This isn't aggressive. It's just enforcing what common sense already requires.
Implement workload rightsizing. Compare requested vs. actual usage. If a workload is consistently using 80%+ of its request, flag for upsizing. If it's using less than 20%, flag for downsizing. But here's the key: In Phase 2, these are still recommendations that go to engineers, not automatic changes.
Automate commitment management. Reserved Instances and Savings Plans should adjust based on actual usage patterns, not be purchased once and forgotten. If your steady-state workload changes, your commitments should change with it.
Phase 2 Checklist
- Idle resource detection with 7-day warning threshold
- Automated rightsizing recommendations for top 50 workloads
- Savings Plans/RI utilization tracking with automated purchase recommendations
- Slack/email integration for cost anomaly alerts
- First rightsizing changes implemented from Phase 2 recommendations
Deploy Autonomous Optimization
By now, your team has seen two months of automated recommendations. They've watched the visibility dashboards. They've acted on reliable rightsizing suggestions. Trust should be building.
Time to remove the human bottleneck from low-risk decisions.
Enable automatic rightsizing within guardrails. Define safe bounds: no more than 25% reduction in a single change, never below workload minimums, never during business hours. Then let the system act within those constraints.
Implement predictive autoscaling. Reactive scaling (wait for metrics, then react) is table stakes. Predictive scaling uses historical patterns to scale before the spike hits. This requires trust in your telemetry โ which is why you built visibility in Phase 1.
Deploy instant rollback. Remember that 23% who needed this for trust? Build it. Every automated change gets a 4-hour window where it can be reverted with a single command or automatically if error rates spike.
Phase 3 Checklist
- Automated rightsizing with guardrails enabled
- Predictive autoscaling deployed for variable workloads
- 4-hour automatic rollback configured for all changes
- Cost per transaction/workload metric established
- Zero manual optimization reviews required for standard workloads
The Tools That Make This Real
You don't need to build this from scratch. The infrastructure automation ecosystem has matured significantly. Here are the tools I see working in production:
Cost Visibility & Optimization
Kubecost or OpenCost โ Gives you cluster-level cost breakdowns. Shows which namespaces and workloads are driving spend. Essential for Phase 1 visibility.
Vantage, CloudHealth, or Spot by NetApp โ Cloud-native cost management across AWS, Azure, and GCP. Good for organizations with multi-cloud complexity.
Autonomous Rightsizing
Karpenter (AWS) or Cluster Autoscaler โ Handles the mechanics of node provisioning. Karpenter is particularly impressive; it provisions right-sized instances based on pending pod requirements.
Vertical Pod Autoscaler (VPA) โ The built-in Kubernetes tool for adjusting CPU and memory requests based on actual usage. Can run in recommendation mode or automatic mode.
Policy Enforcement
Kyverno or OPA/Gatekeeper โ Enforces your tagging policies, resource limits, and security constraints. If you require mandatory tags before a workload can deploy, these tools block non-compliant deployments.
The Microsoft Angle
Recent news at KubeCon EU 2026: Microsoft is moving more services to Kubernetes-native architecture. Their platform engineering teams are investing heavily in autonomous optimization. This isn't bleeding-edge anymore; it's becoming table stakes for competitive infrastructure.
What You're Really Building
Here's the part that doesn't show up in the surveys: Infrastructure automation isn't just about cost savings. It's about engineering velocity.
When engineers don't have to manually review every optimization recommendation, they focus on features. When systems scale automatically, deployments aren't gated by capacity planning meetings. When waste is eliminated automatically, budget conversations shift from "why are we over-spending?" to "what can we build with the headroom we created?"
The 17% of teams already doing continuous optimization aren't just saving money. They're buying focus.
And their competitors โ the ones stuck at 89% agreement but 0% implementation โ are paying twice. Once in cloud waste, and again in lost engineering time.
๐ฅ Bottom line: The gap between "automation is critical" and "we do automation" is where competitive advantage lives. Close it in 90 days using the phases above, or watch someone else do it first.
Your Move
You have three choices:
Stay manual. Keep engineering costs high, cloud costs higher, and velocity low. Accept that 30% waste as "the cost of caution."
Implement the 90-day roadmap yourself. Start with visibility this week. Pick one cluster. Deploy Kubecost. Get baseline metrics. Build momentum.
Get help to move faster. The phases work, but implementation details vary by infrastructure, team size, and risk tolerance. Having walked multiple teams through this transition, I can tell you the common pitfalls and how to avoid them.
The cloud computing market is approaching $1 trillion. The teams that master autonomous infrastructure management will absorb the savings and reinvest in their products. The teams that don't will absorb the waste and wonder why their competitors are faster.
The 89% already agree on what needs to happen. The question is whether you'll be in the 17% that actually does it.
My bet? If you made it to the end of this article, you're already in the group that acts.
Ready to close the automation gap?
Let's audit your infrastructure and build your 90-day roadmap.
From 89% agreement to 100% implementation. Let's talk.