I sat down with a platform engineering lead last week. His team manages 140 Kubernetes clusters across three cloud providers. They're burning $2.3 million annually on compute. When I asked why they don't automate rightsizing, he laughed—not a happy laugh.

"We tried that two years ago," he said. "An automated script downscaled a production database cluster at 2 AM. Took us four hours to recover. The CTO still asks about it in every planning meeting."

This isn't uncommon. In fact, it's the norm. According to CloudBolt's latest research, 89% of IT professionals recognize that automation is crucial for Kubernetes optimization. Yet only 17% are able to continuously optimize their infrastructure automatically. The gap between knowing and doing is massive—and expensive.

99% of Kubernetes clusters are overprovisioned (Cast AI 2025 Benchmark)

The Trust Gap: Why Smart Teams Can't Automate

Here's what's wild about those numbers: 59% of teams can already deploy to production automatically. The CI/CD pipeline works. The automation exists. But when it comes to optimizing resources—actually tuning the infrastructure that runs those deployments—71% require human review before making any changes.

The reason isn't technical. It's trust.

When I dig into why teams won't automate optimization, I hear the same three fears:

These aren't irrational concerns. I've seen what happens when automation goes wrong. But I've also seen what happens when teams let fear prevent optimization entirely.

10% Average CPU utilization across Kubernetes clusters

The Breaking Point: Manual Processes Don't Scale

Here's where it gets really interesting. Two-thirds of surveyed organizations—69%—report that manual optimization breaks down somewhere before 250 changes per day. That's not a theoretical limit. That's when human teams simply can't keep up anymore.

Think about what 250 changes per day looks like:

If you're running Kubernetes at any meaningful scale, you're probably already past that threshold. And if you're still doing optimization manually, you're either:

  1. Missing optimization opportunities (most likely)
  2. Working your platform team to exhaustion
  3. Both

The data backs this up. Cast AI's 2025 Kubernetes Cost Benchmark found that average CPU utilization sits at just 10%, with memory utilization at 23%. That means you're paying for 90% CPU capacity you don't use, and 77% memory headroom that sits idle.

On a $500,000 annual Kubernetes bill, that's $450,000 spent on potential—not production.


The Automation Trust Framework

So how do you close the gap between "we know we need this" and "we can actually do this safely"? Here's the framework I use with teams that are ready to move past manual optimization:

Step 1: Start with Observation, Not Action

Before you automate changes, automate visibility. Most teams skip this step and wonder why nobody trusts the automation.

Goal: Build shared understanding of where the waste lives before anyone proposes changing it.

Step 2: Automate Recommendations First

Don't automate the fix yet. Automate the suggestion. Let the system tell you what it would do—and review those recommendations as a team.

Example workflow: A script analyzes resource utilization weekly and posts a Slack message: "Pod 'api-gateway' requests 4Gi memory but uses 1.2Gi. Recommended: reduce request to 2Gi with 3Gi limit. Estimated monthly savings: $180."

This does two things: It trains your team to understand what good optimization looks like, and it proves the automation understands your environment before you give it control.

Step 3: Implement Guardrails Before Automation

Remember that 25% of teams want proven guardrails? Here's what that looks like in practice:

These aren't permanent restrictions. They're training wheels. Once the automation proves itself, you expand the scope.

Step 4: Build Instant Rollback

The 23% who want instant rollback capabilities are onto something. You need a way to undo changes fast.

Here's the minimal viable rollback system:

  1. Store previous resource configurations in a ConfigMap or Git before applying changes
  2. Apply changes with a 24-hour "evaluation period" before making them permanent
  3. Create a single command (or Slack bot) that reverts the last optimization batch
  4. Set up alerts that trigger if error rates or latency increase after an optimization

The goal isn't zero incidents. It's fast recovery when incidents happen. A 5-minute rollback turns a potential disaster into a minor hiccup.

Step 5: Graduate to Continuous Optimization

Once you've run through steps 1-4 for 30-60 days without incident, you're ready for true continuous optimization. This is where the real savings live.

Continuous optimization means:

This is what that 17% of teams have figured out. And it's why they're spending significantly less on infrastructure while running more workloads.


The Platform Engineering Advantage

Here's the pattern I've noticed: organizations with mature platform engineering practices close this trust gap faster. Why? Because platform teams treat infrastructure automation as a product, not a script.

A good platform team doesn't just write automation. They:

The result? Developers trust the platform to make optimization decisions because the platform team prioritized trust-building over speed.

Consider this: 54% of surveyed organizations are managing more than 100 Kubernetes clusters. At that scale, manual optimization isn't just inefficient—it's impossible. The teams winning right now are the ones who invested in automation trust early, before the complexity overwhelmed them.


The Cost of Waiting

Every month you delay automation, you're paying for resources you don't use. Let's be concrete about what that looks like:

If you're running Kubernetes at enterprise scale—say $100,000/month in compute costs—and your utilization matches the industry average of 10% CPU and 23% memory, you're spending roughly $70,000 monthly on idle capacity.

Even conservative automation that brings you to 40% CPU utilization (still leaving plenty of headroom) would save you $30,000 per month. That's $360,000 annually—just from rightsizing automation.

The question isn't whether you can afford to automate. It's whether you can afford not to.

$360K Potential annual savings from basic rightsizing automation on $1.2M infrastructure spend

Start This Week

You don't need a perfect platform engineering practice to start. You need one step forward.

This week:

  1. Deploy Kubecost or OpenCost. Get visibility into your actual utilization.
  2. Identify your top 5 most overprovisioned workloads.
  3. Write a script that recommends (not applies) rightsizing changes.
  4. Review those recommendations with your team.
  5. Pick one non-critical workload and manually apply the recommendation. Measure the results.

That's it. Five steps. One afternoon. The start of closing your automation trust gap.

The teams that figure this out now will have a massive advantage in the next 18 months. As Kubernetes complexity grows—and it will—the gap between automated and manual optimization will widen. The cost of being in that 83% will only increase.

The 17% who've solved this aren't smarter. They just started sooner.

Want help with this?
I'll audit your Kubernetes automation readiness and build a roadmap to continuous optimization.

clide@butler.solutions

Based in Detroit. Serving infrastructure teams globally.