TL;DR

Platform engineering was built around the cluster. That model worked when infrastructure meant a handful of clusters. It doesn't anymore.
Five key drivers — AI/GPU infrastructure, multi-cloud proliferation, dedicated SaaS environments, edge deployment, and regulated industries — are pushing platform teams to operate at fleet scale.
The toolstack has evolved in waves: deployment has matured, promotion is maturing, and fleet-scale promotion is the new frontier.
Fleet promotion requires a different set of building blocks than single-cluster promotion: failure thresholds, rollback as a first-class operation, aggregate observability, per-team RBAC, and the ability to treat the fleet as a unit.
Platform teams that build fleet promotion as a first-class capability now will have infrastructure that scales with the business. Those that defer it will feel the gap on every release.

Platform engineering was built around the cluster — deploying into it, monitoring it, recovering it. GitOps, and Argo CD in particular — created by the team behind Akuity — established the foundation: keeping clusters in sync with desired state, reliably and at scale. That model worked when infrastructure meant a handful of clusters. That's no longer the world platform engineers are operating in.

The infrastructure reality facing platform teams today isn't one cluster or ten clusters. It's hundreds or even thousands. The advances driving this: GPU infrastructure at neocloud scale, multi-cloud proliferation, tightening data residency rules, the push to the edge. Each is redrawing what production infrastructure looks like, and in each case the answer involves more clusters, more environments, and more operational surface than the tools were designed for.

Why fleet-scale infrastructure is becoming the new normal

The shift to fleet-scale infrastructure isn't happening in isolation. We see several drivers behind this change, and these are the ones hitting platform engineering teams hardest:

AI and GPU infrastructure. Neoclouds like CoreWeave operate Kubernetes-native GPU infrastructure spanning 250,000+ GPUs. Each node carries drivers, firmware, and CUDA versions on a release cadence set by NVIDIA — not the platform team. The update surface is massive, the schedule is externally dictated, and drift across nodes has direct performance consequences. The platform engineering challenge here isn't just scale; it's that scale arrived almost overnight, at a density never seen before.

Multi-region and multi-cloud proliferation. Sovereignty rules, latency requirements, and resilience architecture mean the same workload now runs across many regions and providers simultaneously. Mercedes-Benz scaled from 200 to roughly 1,000 clusters. Cluster counts didn't grow linearly with the business; they multiplied.

Dedicated SaaS environments. The deployment target isn't a shared platform; it's a customer's isolated environment. Veeva Systems, which serves more than 1,500 life sciences customers, runs software where every customer environment carries its own validation requirements, compliance obligations, and release cycle. One product, never one version everywhere. Every tenant is its own promotion target with its own gates.

Edge and retail locations. Latency requirements and data-residency rules are pushing workloads closer to users, and platform engineers are being asked to make that happen at scale. Chick-fil-A runs edge Kubernetes clusters in each of its approximately 2,800 restaurants, enabling business-critical workloads to run without internet dependency. Rollouts have to account for connectivity constraints, local hardware variance, and the operational reality that you can't SSH into a store.

Regulated industries. Capital One has been building on a hardened, compliance-first Kubernetes platform since the early days. The platform is purpose-built to carry security controls, audit requirements, and governance policies across every environment. In financial services, healthcare, and government, each environment carries its own compliance classification and change control gate. Promotion isn't just an engineering decision — it's a legal one.

From cluster-level deployment to fleet: how the platform engineering toolstack is evolving

The first wave of platform tooling focused on deployment. Argo CD and Flux, the GitOps tools that became the industry standard, addressed it directly: reconciling desired state at the cluster level, continuously and reliably. Today, cluster-level deployment is the most mature part of the modern software delivery pipeline.

The second wave addressed promotion: how does a release move through environments — dev, staging, production — with gates, verification, and consistency? This is the problem Kargo was built to address. Where Argo CD handles deployment, Kargo handles what happens before that: deciding what gets promoted, to where, and under what conditions. Declarative pipelines, native release gates, and freight as a first-class release unit give platform teams a shared promotion model that scales across services and teams.

The third wave has arrived: fleet-scale promotion. Operating a release across a fleet requires a different model entirely — and platform teams are feeling that gap. The questions change: not just did this promote successfully, but how many clusters are on the new version, where did it fail, and what happens next.

What platform engineers need to get fleet promotion right

Fleet promotion isn't single-cluster promotion at larger scale; it's a different set of building blocks entirely. Here's what platform teams need to get it right:

Failure thresholds, not binary outcomes. At fleet scale, some failures are expected. The question isn't whether any cluster failed; it's whether the failure rate crossed a threshold that warrants stopping the rollout. A system that treats one failure in 500 the same as 100 failures in 500 isn't fit for purpose.

Ordered rollout across deliberate waves. A change should move through the fleet in sequence — pilot environments first, then canary, then the general population, with the highest-stakes targets last. Sequencing by risk is what keeps a bad change from reaching everything before anyone notices. Dropping a change across the full fleet simultaneously isn't a deployment strategy; it's a blast radius.

Rollback as a first-class operation. Rolling back a release across a fleet of hundreds of clusters needs to be as deliberate and observable as rolling it forward. Ad hoc rollback at scale is where incidents happen.

Aggregate observability. Platform teams need a single view of what's happening across the fleet during a rollout — not N individual cluster dashboards. What percentage of clusters are on the new version? Where did promotions fail, and why? Is the rollout proceeding within expected parameters?

Per-team RBAC. At the organizations driving fleet-scale adoption, multiple teams own subsets of the fleet. The promotion system needs to reflect that: scoped control, not global access.

The fleet as a unit. When something goes wrong, teams need to be able to act on the fleet as a whole: stop the rollout, roll back across all affected clusters, investigate. A system that requires per-cluster remediation at scale is operationally untenable.

Best practices for platform engineers operating at fleet scale

Across the organizations we work with, the platform teams getting ahead of this problem share a few common patterns:

Define the fleet before the rollout. A fleet isn't just "all clusters"; it's a logical grouping with shared characteristics, ownership, and promotion policy. Defining fleet boundaries explicitly, before a rollout starts, is what makes aggregate observability and scoped RBAC possible.

Treat promotion policy as code. Failure thresholds, soak times, approval gates, and rollback conditions should be declared alongside the rest of the deployment configuration — not set at runtime by whoever is on call. Policy drift across environments is how you get inconsistent behavior at the worst possible moment.

Instrument before you scale. The observability you need at fleet scale is different from what you need at cluster scale. Teams that build aggregate rollout visibility early — before they're operating 500 clusters — aren't caught flat-footed when the fleet grows.

Design for partial failure. At fleet scale, a rollout that stops at the first failure is often worse than one that continues with visibility. Build explicit failure handling into the promotion model: what's the threshold, what happens when it's crossed, and who gets notified.

Separate fleet management promotion from application deployment. The platform team that manages the fleet promotion pipeline shouldn't need to understand the application, and the application team shouldn't need to understand the fleet topology. Clean separation of concerns here prevents both coordination overhead and blast-radius incidents.

What platform engineers should be building for now

The forces driving fleet-scale adoption are accelerating. GPU infrastructure is hitting new heights every week. Multi-cloud architecture is now the standard. SaaS isolation and edge deployment are becoming table stakes for enterprise software. Regulatory requirements are tightening and showing no signs of reversing.

Platform engineers who build fleet promotion as a first-class capability now — declarative, policy-driven, observable — will have infrastructure that scales with the business. Those who defer it will find themselves operating at fleet scale with single-cluster tooling and feeling the gap on every release.

This is something our team has been thinking about for a long time. When we built Argo CD, we could already see the shape of the problem that would come next — and Kargo was the answer to that. After building Kargo as the promotion layer for Kubernetes, VMs, Terraform and beyond, Fleet Promotion Management is the next logical step. Learn more at akuity.io.

Want to see continuous promotion in practice?
Get hands-on with Kargo in about 20 minutes. Try the Quickstart.