Why 89% of DevOps Teams Are Stuck at the Starting Line

A survey published last week by CloudBolt Software asked 321 Kubernetes practitioners at enterprise organizations (1,000+ employees) about their automation capabilities. The results reveal a troubling disconnect between what teams know they should do and what they actually can do.

The data is eye-opening: 59% of teams can deploy to production automatically. That is the baseline we have all been chasing for years. But when it comes to continuously optimizing the infrastructure those deployments run on? That number plummets to 17%.

72 percentage points The gap between "deploy automatically" and "optimize continuously"

I see this constantly in my consulting work. Teams brag about their CI/CD pipelines, their GitOps workflows, their automated testing. But ask them about cost optimization, resource right-sizing, or proactive cluster management and the conversation gets quiet fast.

"We do not want to touch it once it is running," one platform engineer told me last month. "Last time we tried to optimize memory requests, we caused an outage. Now we just leave it alone."

The Automation Trust Gap Explained

Here is what is actually happening: Most teams have automated the easy part (deploying code) but not the hard part (optimizing infrastructure). They trust automation to push new versions, but they do not trust it to manage resource allocation, scale efficiently, or clean up waste.

The survey data tells the story clearly:

89% Recognize that automation is crucial for Kubernetes management

59% Can deploy to production automatically

17% Can continuously optimize cluster infrastructure

71% Require human review before applying resource optimizations

That 71% requiring human review is the smoking gun. Teams have been burned by manual optimization attempts in the past. Resource changes caused instability. Memory tweaks led to OOMKills. Configuration updates created cascading failures. So now they gate every optimization decision behind human approval.

The problem? That approval process does not scale. The survey found that manual optimization processes break down around 250 changes per day. In a large enterprise with hundreds of microservices, each pushing multiple times daily, you are well past that threshold before lunch.

Why Teams Do Not Trust Continuous Optimization

When survey respondents were asked what would increase their trust in automation, the answers were revealing:

48% Want better visibility and transparency into what automation is doing

25% Need proven guardrails and safety mechanisms

23% Require instant rollback capabilities when changes go wrong

Notice what is missing? Nobody is asking for more AI. Nobody wants "intelligent automation" or "machine learning-driven optimization." They want fundamentals: visibility, safety nets, and the ability to undo mistakes quickly.

This is where the Kubernetes optimization tooling market has gone sideways. Vendors keep pitching AI-powered solutions when teams are asking for better guardrails. They promise "set it and forget it" when practitioners want "show me exactly what you are about to change."

The Real Cost of the Trust Gap

Here is what the 72% gap actually costs you:

1. Silent Infrastructure Waste

Under-utilized processors and memory have become a significant total cost of ownership problem. Teams are paying for capacity they do not need because they are afraid to right-size. The survey notes that 54% of organizations are running 100+ Kubernetes clusters, each potentially hosting thousands of workloads competing for the same CPU and memory resources.

When you are manually reviewing every optimization, you optimize rarely—maybe quarterly at best. In the meantime, workloads grow or shrink, traffic patterns change, and your infrastructure drifts further from optimal. You are overpaying every single day you wait.

2. Accumulating Technical Debt

The "do not fix what is not broken" mindset sounds conservative, but it is actually risky. Suboptimal configurations accumulate. Yesterday's "safe" memory request becomes tomorrow's bottleneck. The cluster that felt spacious last year is now resource-constrained.

When you wait for quarterly optimization cycles, you miss the gradual degradation. By the time you notice the problem, you are in emergency mode—making rushed changes under pressure, which increases the risk of the exact outages you are trying to avoid.

3. Team Burnout

Manual optimization at scale is soul-crushing work. Reviewing hundreds of potential changes, cross-referencing usage data, estimating safety margins, and documenting decisions. It is necessary but tedious, and it steals hours from strategic work.

The teams that cannot automate optimization are stuck in reactive mode—constantly fighting fires, never improving. That is a recipe for burnout and turnover in roles that are already hard to hire for.

The Continuous Optimization Framework

Closing the 72% gap is not about buying new tools or hiring more engineers. It is about building trust incrementally. Here is the framework I use with teams who want to move from "deploy and pray" to "optimize with confidence":

Phase 1: Visibility Before Action (Weeks 1-2)

Before you automate any changes, you need to see what is actually happening. Most teams have surprisingly poor visibility into their infrastructure efficiency.

Implement resource usage monitoring across all workloads (kubectl top, Prometheus metrics)

Calculate actual utilization vs. requested resources for every deployment

Set up cost allocation tags/labels so you can trace spend to specific workloads

Create a weekly "optimization opportunity" report showing low-utilization resources

Share reports with the team—make waste visible to everyone

The goal here is just visibility. Do not optimize anything yet. Get comfortable seeing the data. Let your team understand the scope of the opportunity before you propose changes.

Phase 2: Recommend Before Automating (Weeks 3-6)

Now generate optimization recommendations, but do not implement them automatically. Create a "dry run" mode where your tooling suggests changes that humans review and approve.

Build queries/scripts that identify clear optimization candidates (e.g., pods using <40% of request)

Generate specific recommendations: "Reduce memory request from 4Gi to 2Gi based on 30-day peak usage of 1.6Gi"

Present recommendations in a standard format with confidence scores and rollback procedures

Track which recommendations get approved, rejected, or modified—learn your team's risk tolerance

This phase builds trust in your recommendations. The team sees that you are not making reckless suggestions. Patterns emerge. Safe optimizations become routine.

Phase 3: Automate the Safe Wins (Weeks 7-10)

Start with the optimizations that have the highest confidence and lowest risk. These are typically non-critical workloads, dev environments, and obvious misconfigurations.

Select non-production namespaces as your automation proving ground

Implement automated rollback tied to health checks—if latency/error rates spike, revert immediately

Set conservative change limits: max 10% resource reduction, max 1 change per workload per day

Notify the team before and after every automated change—maintain visibility

Track outcomes obsessively: Did the change stick? Did performance suffer? Did we save money?

Success metrics for this phase: zero production incidents caused by optimization, 20-30% of recommendations automated, measurable cost savings.

Phase 4: Expand and Refine (Weeks 11+)

With confidence established, gradually expand automation to production workloads. This is not about removing human judgment—it is about focusing human attention on exceptions and edge cases.

Whitelist specific production workloads for automated optimization based on Phase 3 success

Escalation path: when automated changes are uncertain, raise them for human review

Monthly optimization retrospectives: What worked? What failed? How do we improve next month?

Measure and report: monthly savings, incidents prevented, team hours reclaimed

What Matters Most

Here is what I have learned from helping teams close their trust gaps:

Start small. The 17% of teams doing continuous optimization did not get there overnight. They started with visibility, then recommendations, then gradual automation. Do not try to leap straight to "set it and forget it."

Trust is built through transparency, not accuracy. Teams do not need perfect optimization algorithms. They need to understand what the system is doing and why. Every automated change should be explainable to a junior engineer.

Rollback speed matters more than change precision. Your automation will make mistakes. That is inevitable. What matters is how quickly you can detect and revert them. Invest in monitoring and rollback before you invest in smarter algorithms.

Cost savings are a side effect. The real value of continuous optimization is not the money saved (though that is nice). It is the habits formed, the technical debt prevented, and the team liberated from reactive firefighting.

The Bottom Line

The 89%/17% gap is not a technology problem. It is a trust problem. Teams have been burned by bad automation, so they reject all automation. That is understandable but expensive.

The path forward is incremental visibility, conservative automation, and relentless transparency. You do not need AI. You do not need new vendor tools. You need the discipline to build confidence step by step.

The organizations that master continuous optimization will outcompete those that do not—not because they have smarter algorithms, but because they have lower infrastructure costs, less technical debt, and teams that spend their time on innovation instead of maintenance.

If you are in the 72% gap, you are not behind. You are normal. But you have a choice: remain normal, or close the gap systematically. The first step is simply deciding to start.

Want help with this?
I help DevOps teams build trust in continuous optimization step by step. No AI magic, just proven systems.

clide@butler.solutions

Based in Detroit. Serving infrastructure globally.