Cloud infrastructure used to be a trade-off between speed and control.
- ClickOps was fast but messy.
- IaC (Infrastructure as Code) brought repeatability and guardrails.
- Platform engineering scaled reliability—but at a cost in time and complexity.
Now we’re entering a new phase: AI-driven infrastructure. Agents that plan, execute, validate, and remediate. LLMs that turn “what do you need?” into working Terraform-like changes (or Kubernetes manifests, pipelines, policies). And AIOps that reduces noise, correlates incidents, and suggests fixes while your team sleeps.
It’s exciting. It’s also expensive. And it introduces a new strategic risk: dependency.
Let’s break down what’s changing; and what you should do about it.
The shift: from “writing infra” to “orchestrating infra”
For years, most cloud work followed a familiar rhythm. An engineer proposes a change—networking, compute, IAM, pipelines—someone reviews it, it gets applied, and then reality does what it always does: an edge case appears, a dependency behaves differently than expected, and somebody is debugging under pressure.
AI agents change that choreography. Instead of humans crafting every step by hand, the human increasingly becomes the conductor: defining intent, supplying constraints, and deciding whether the agent’s output is safe enough to execute. In theory, it’s a productivity leap. In practice, it’s also a risk shift.
Because when the “operator” is an agent, the system can move from a slow mistake to a fast mistake. And speed is not neutral in production—it amplifies blast radius.
Key idea: AI doesn’t remove work, it relocates it.
The hard part becomes governance: shaping intent, validating diffs, and enforcing control boundaries.
The best teams treat agents like junior engineers with unlimited stamina: useful, quick, and sometimes wrong in subtle ways. That framing keeps you cautious in the right places—especially around security and production changes.
AIOps is becoming the reliability control plane
AIOps has existed for a while, but LLMs plus modern telemetry have changed the ceiling. Instead of merely detecting anomalies, we’re seeing systems that can interpret context: correlating incidents across logs, metrics, traces, deploy timelines, and dependency graphs. That unlocks faster triage, smarter routing, and remediation suggestions that resemble real on-call playbooks.
But it only works when the fundamentals are sound. If your observability data is noisy, inconsistent, or poorly tagged, the model doesn’t magically become “more correct.” It becomes more confident while being wrong—which is arguably worse than ignorance because it pushes teams to act quickly on flawed conclusions.
AIOps can reduce MTTR and alert fatigue, but it must operate inside strict guardrails. Reliability automation without boundaries becomes “autopilot,” and autopilot is only a good idea when you fully trust the sensors, the control logic, and the failure modes. Most infrastructures are not there yet.
Reality check: LLMs are excellent at pattern recognition and narrative stitching.
They are not inherently reliable at causal inference in messy distributed systems.
The hidden bill: agents are token-hungry by design
One of the most common surprises is cost. Teams think of LLMs as a Q&A layer, but agents don’t behave like chat. Agents iterate. They fetch context, draft a change, check results, revise, and repeat—often across multiple tools and multiple models.
In a real infrastructure workflow, an agent may pull repository context, inspect IaC modules, query cloud state, generate a plan, interpret the diff, compare it to policy, adjust parameters, and run the loop again. Add logs, dashboards, security findings, and compliance checks, and you’re feeding it a huge amount of context every time.
That’s where token usage turns into a serious line item. The expensive moments aren’t the short prompts; they’re the long-running loops where each cycle reintroduces large diffs, verbose policy output, or incident threads that never end. You won’t notice it on day one. You notice it when the workflow spreads: CI uses LLMs by default, ops consults the agent on every alert, and “AI-first” becomes the standard operating procedure.
Cost isn’t just tokens. It’s also tool calls, latency, retries, and the human time spent supervising AI output.
AI as a commodity… and the dependency trap after it
It’s reasonable to expect many AI capabilities to commoditize. We’ll see more providers, better open alternatives, improved price/performance, and cheaper inference. That sounds like a buyer’s market.
The risk appears later—when AI becomes embedded in your operational muscle memory. Once agents are responsible for the workflows that keep production stable, they move into the “must not fail” path. At that point, switching becomes painful: not because the model is impossible to replace, but because your organization has reorganized itself around the assumption that AI is always available, always fast, and always there to fill gaps in process.
That dependency changes the economics. If your incident response, change management, or infrastructure provisioning relies on a specific vendor’s agent stack, pricing changes stop being a procurement annoyance and become an operational threat. The pattern is familiar: primitives become cheap, but managed layers create stickiness. Adoption is easy; unwinding is expensive. Dependency becomes leverage.
The trap isn’t “AI is expensive.”
It’s “AI becomes indispensable,” and indispensable tools eventually gain pricing power.
Security: don’t turn your AI into an all-access admin
It usually starts with good intentions. You give an agent read-only access so it can inspect logs, understand an incident, or draft a change. It performs well, the team gets faster, and confidence builds.
Then comes the slippery step: “Let’s just give it broader permissions so it can fix things end-to-end.” Suddenly the agent has access to every repository, the ability to push code, and wide cloud permissions across environments.
That’s where AI stops being a helpful assistant and becomes a privileged operator—often without the same controls you’d require for a human.
Important: The risk isn’t only “hallucinations.”
The real issue is speed + privilege + complexity.
A wrong change can propagate across repos and environments in minutes.
When an AI can act quickly and widely, small mistakes don’t stay small. A single misread of context can open network paths, broaden IAM policies, rotate secrets incorrectly, or deploy configuration that looks “fine” but quietly weakens security. And because the changes happen fast, you might not notice until the blast radius is already large.
Why “it worked fine” is not a security signal
Teams often expand permissions because early tests look good. But success in a few runs is not proof of safety—especially in production systems where edge cases and hidden dependencies are the norm.
The moment you trust the agent enough to remove guardrails is the moment you need more guardrails. Not fewer.
A safer model: AI proposes, pipelines enforce
The healthiest pattern is to treat AI like a contributor, not an operator.
Let the agent draft changes, suggest fixes, and prepare pull requests. But keep execution behind the same controls you’d use for any production-grade change: review, automated checks, and policy enforcement. This prevents the worst-case scenario—an agent applying destructive or insecure changes directly in production—while keeping most of the speed.
Here’s the key principle:
AI can accelerate decisions. It should not bypass your change control.
IaC is still the source of truth (especially when AI is involved)
Even if an agent can “just do it,” you shouldn’t let it.
Infrastructure as Code is your safety rail: it’s auditable, reviewable, and reversible. When something goes wrong (and it will, eventually), IaC is what makes rollbacks and incident response sane. It gives you a clear diff, a history of who approved what, and a repeatable path to restore known-good state.
Without IaC, you’re left with a dangerous mix: fast changes, partial visibility, and manual cleanup under pressure.
Best practice:
If a change can’t be expressed in IaC, it doesn’t belong in automated production workflows.
This doesn’t mean AI isn’t useful. It means the output of AI should become versioned infrastructure—not invisible, one-off actions.
The “minimum safe access” mindset
Instead of giving agents broad admin permissions, design your setup so the agent can do its job with controlled scope. Make read access the default. When write access is truly necessary, constrain it to a narrow role, a limited environment, and short-lived credentials. In other words: reduce blast radius by design, not by hope.
And on the Git side, treat direct pushes to main as a hard line. AI-generated changes should flow through the same path as every other production change: pull request, checks, review, merge, deploy.
Rule of thumb:
If you wouldn’t give a new engineer “push to prod + admin IAM” on day one, don’t give it to an agent—ever.
How to adopt AI without surrendering control
You can absolutely get the benefits—faster delivery, better triage, more consistent patterns—without letting agents run wild. The difference is whether you treat AI like a toy or like production infrastructure.
Start by operationalizing it: define what “good” looks like (latency, availability, acceptable error modes), design explicit fallback paths (human runbooks, deterministic automation), and make agent behavior auditable. Prompts, outputs, tool calls, approvals—this is all production telemetry now. If you can’t replay what happened, you can’t trust it.
Then place guardrails where failure is costly. Least privilege is non-negotiable, and “read-only by default” is the right baseline. Execution should be staged and boring: generate a plan, run policy checks, require approval, then apply. Anything touching IAM, networking, data access, or production rollout deserves mandatory peer review, just like it would for a human engineer.
Finally, be intentional about portability. Don’t weld your entire workflow to one vendor’s agent framework. Separate the model from the workflow orchestration where possible. And keep your source of truth where it belongs: IaC in Git. Agents can propose changes; your platform should execute them through reviewable pipelines. That’s how you keep rollbacks simple and drift under control.
The biggest failure mode isn’t hallucination.
It’s complacency: teams stop thinking because “the agent said it’s fine.”
AI should compress toil and accelerate judgment—not replace accountability. If you design it that way, you get speed with safety. If you don’t, you get automation that can break production faster than any human ever could.
Closing thought
AI agents and AIOps are pushing cloud operations into a new era—similar to what IaC did for repeatability, but with a much bigger acceleration factor. Done well, you’ll deliver infrastructure changes faster, reduce time-to-diagnose during incidents, and enforce standards across teams with far less manual effort.
But the trade-off is real. Agents introduce a new cost curve (tokens, tool calls, orchestration overhead) and a new category of operational risk: high-privilege automation moving at machine speed. On top of that sits the strategic layer—once AI becomes part of the “must not fail” chain, switching providers gets harder and pricing power shifts away from you.
The teams that win won’t be the ones who bolt AI onto everything. They’ll be the ones who treat it like production infrastructure: constrained permissions, auditable actions, staged execution, and IaC as the source of truth—so every change is reviewable and every rollback is possible.
Want help doing this safely?
If you’re rolling out AI-assisted infrastructure or AIOps and want the guardrails, cost controls, and operating model that keeps production predictable, Good2Cloud can help you implement it—without turning your agent into an unchecked admin.
