Kubernetes Cluster Lifecycle Management (KCLM)
This guide covers the deployment of a Kubernetes Cluster Lifecycle Management (KCLM) solution to use metal-stack as a cloud provider. metal-stack supports three KCLM solutions:
KCLM Solutions Overview
Gardener (Recommended)
Gardener is the recommended KCLM solution for metal-stack. It is battle-tested in production for over seven years at financial-sector customers and bundles more day-2 capabilities natively (DNS, backup, audit). Gardener manages entire clusters as Kubernetes-native resources with a strong separation between platform operators and end-users.
We recommend using a dedicated cluster for Gardener, separate from the metal-stack initial cluster. While it is technically possible to deploy both metal-stack and Gardener on the same initial cluster, dedicated clusters provide better isolation, clearer operational boundaries, and align with production best practices for critical infrastructure. For guidance on setting up the initial cluster, see the Initial Cluster documentation.
For more details on Gardener terminology, architecture, operational model, failure domains, and operational features, see the Gardener concept doc.
Deployment Summary
Gardener can be deployed with the gardener-* Ansible roles.
The following data center infrastructure dependencies are treated as given and must be available before deploying Gardener:
- DNS — For cluster domain resolution
- NTP — Time synchronization across all nodes
- ACME — Certificate authority (for shoot certificates via shoot-cert-service)
- S3-compatible object storage — For etcd backups (gardener-extension-backup-s3)
- Git-Hosting with CI/CD — You must set up your own Git repository and CI/CD pipeline to manage cluster deployments (see Fleet Management and GitOps below).
The following dependencies are introduced:
- CNI: Calico or Cilium
- MetalLB for exposing the Kubernetes API servers of the clusters
In summary, this results in the following cluster hierarchy:
- Garden cluster — The Gardener control plane (Gardener API server, controller manager, scheduler, admission controller) deployed on a dedicated cluster.
- Seed — A cluster running the
gardenletagent, connected to the Gardener control plane. Seeds are deployed inside the metal-stack partition and orchestrate cluster provisioning within that site. - Shoot — Every fully provisioned and managed Kubernetes cluster. Shoot control planes run as pods in the Seed namespace, while worker nodes are provisioned as bare-metal machines on metal-stack infrastructure.
We are officially supported by Gardener dashboard. The dashboard helps you manage Shoots, Seeds, and Projects through a web UI.
Core Controllers
The Gardener platform consists of the following core controllers, all deployed via the Ansible roles:
| Component | Responsibility |
|---|---|
gardener-operator | Deploys Gardener components, gardenlets, and extensions; manages platform updates |
gardener-apiserver | Extends the kube-apiserver with Gardener-specific resources (Shoot, Seed, Project, etc.) |
gardener-scheduler | Decides where clusters are placed across the Gardener landscape (Seeds) |
gardener-controller-manager | Reconciles common Gardener resources (projects, controller installations, etc.) |
gardenlet | Agent running on each Seed; orchestrates provisioning of new clusters within that Seed |
gardener-resource-manager | Runs inside Shoots; reconciles desired resources and checks their health |
etcd-druid | etcd cluster operator with built-in backup-restore functionality |
machine-controller-manager | Manages worker node lifecycle (rolling updates, health recreation, scaling) |
machine-controller-manager-provider-metal | Integrates metal-stack machine provisioning API with Gardener's MCM |
Gardener Extensions
Gardener's extensibility model allows provider-specific reconcilers to be deployed during cluster provisioning. The gardener-extensions Ansible role deploys the following extensions into the Gardener runtime cluster:
| Extension | Purpose |
|---|---|
gardener-extension-provider-metal | IaaS integration — reconciles Infrastructure, ControlPlane, and Worker resources via the metal-stack API |
os-metal-extension | Translates Gardener's generic OperatingSystemConfig into cloud-init/ignition userdata |
gardener-extension-networking-calico | Calico CNI extension |
gardener-extension-networking-cilium | Cilium CNI extension (alternative to Calico) |
gardener-extension-dns-powerdns | DNS management via PowerDNS |
shoot-dns-service | DNS service for Shoot clusters |
gardener-extension-backup-s3 | etcd backup to S3-compatible object storage |
gardener-extension-audit | Audit logging webhook |
gardener-extension-acl | Access control list management |
shoot-cert-service | Certificate management with Let's Encrypt (supports Shoot-level issuers) |
gardener-extension-csi-driver-lvm | LVM-based CSI driver for local storage |
gardener-extension-ontap | NetApp ONTAP CSI driver |
Most extensions are enabled/disabled via Ansible variables (e.g., gardener_extension_provider_metal_enabled). Key configuration variables for the metal provider include:
gardener_extension_provider_metal_etcd_storage_class_name— Storage class for Shoot etcdsgardener_extension_provider_metal_etcd_backup_schedule— etcd backup schedulegardener_extension_provider_metal_machine_images— Machine images (typically matches CloudProfile)gardener_extension_provider_metal_admission_default_pods_cidr— Default pod CIDR for Shootsgardener_extension_provider_metal_admission_default_services_cidr— Default services CIDR for Shoots
For the full variable reference, see the gardener-extensions README.
Fleet Management and GitOps
You must set up your own Git repository and CI/CD pipeline to manage cluster deployments. This gives you peer review, audit trails, and rollback capabilities.
What you need to build:
- Git repository — Store the following as YAML manifests:
cloudprofiles/— CloudProfile definitions (whitelisted regions, machine types, OS images, Kubernetes versions)seeds/— Seed configurations per data centerprojects/<name>/shoots/— Per-project Shoot manifestsextensions/— Helm charts for Gardener extensions
- CI/CD pipeline — Deploy manifests from Git to the Gardener API (Virtual Garden). This pipeline is your primary interface for fleet-wide changes.
- Branching strategy — Use separate branches or environments (staging → production) to validate changes before rolling them out fleet-wide.
Operational capabilities provided by Gardener:
Once your GitOps pipeline is in place, Gardener provides the following day-2 operational features:
- CloudProfile validation — Administrators define allowed regions, machine types, operating systems, and Kubernetes versions. Shoot specs are validated against the CloudProfile before being stored in the Virtual Garden's ETCD.
- Multi-stage environments — End-users can label clusters as
evaluationordevelopmentto test upcoming Kubernetes versions and auto-upgrades before rolling them out toproductionclusters. - Maintenance time windows — Configurable per Shoot; all day-2 operations (Kubernetes patch updates, machine image updates) are carried out within these windows.
- Emergency patching — Administrators can apply fleet-wide changes via image vector overwrites in the Gardener deployment Git repository. Changes must be validated in a dedicated staging environment first.
- Accidental deletion protection — Shoot deletion is guarded by specific annotations. ETCD backup retention timeouts are configurable, allowing cluster restoration after accidental deletion.
Cluster-API (Alternative)
Cluster-API is a CNCF project maintained by a Kubernetes SIG that provides declarative cluster management through a management cluster. The metal-stack provider (CAPMS) is under development and not yet production-ready.
The cluster-api-provider-metal-stack (CAPMS) infrastructure provider translates CAPI resources into metal-stack API calls for machine, firewall, and IP allocation. CAPMS is tested against the Kubeadm Bootstrap Provider (CABPK) and uses the Add-on Provider for Helm (CAAPH) for installing CNIs like Calico and the metal-ccm.
Cluster-API with metal-stack is in development and not advised for production use. Please use Gardener for production workloads. We are actively looking for exchange and adopters — if you are interested in using Cluster-API with metal-stack, please join our community to help shape future integration efforts.
For more details on Cluster-API concepts, architecture, operational model, and control plane hosting, see the Cluster-API concept doc.
Deployment
Cluster-API with metal-stack is deployed through the cluster-api-provider-metal-stack (CAPMS) infrastructure provider. The full deployment guide is available in the CAPMS README.
Deployment flow
- Prepare management cluster — A Kubernetes cluster to host CAPI controllers and cluster state (e.g., a
kindcluster) - Configure
clusterctl— Add the metal-stack provider URL to~/.config/cluster-api/clusterctl.yaml - Set environment variables —
METAL_API_URL,METAL_API_HMAC,METAL_PROJECT_ID,METAL_PARTITION, control plane/worker/firewall machine images and sizes, cluster name, namespace, and Kubernetes version - Install CAPMS — Build and push the CAPMS image, then run
make installandmake deploy(or apply the releasedinfrastructure-components.yamldirectly) - Initialize clusterctl — Run
clusterctl init --infrastructure metal-stackon the management cluster - Generate and apply cluster manifest — Use
clusterctl generate clusterto produce a YAML withCluster,MetalStackCluster,KubeadmControlPlane,MachineDeployment, andMetalStackMachineresources, then apply it - Deploy add-ons — Install CNI (Calico) and
metal-ccminto the workload cluster viaClusterResourceSetand CAAPH (Cluster API Add-on Provider for Helm) - Retrieve kubeconfig — Use
clusterctl get kubeconfigto access the provisioned cluster
Network integration
Network integration for Cluster-API is currently more manual compared to Gardener. Node networks must be created manually via metalctl and provided as environment variables. IP addresses for the control plane also need to be allocated in advance through metalctl. Firewall rules are currently static and can be applied to firewall nodes; no automatic firewall controller is in place yet. Automatic network resource allocation is on the roadmap for CAPMS.
For service exposure, CAPMS uses KubeVIP in BGP mode to allocate and announce public IPs, similar to the MetalLB-based approach in Gardener.
Air-gapped environments
For air-gapped deployments, follow the Cluster API Operator air-gapped environment guide. All required images must be mirrored to an OCI registry reachable from the management cluster.
Fleet management and GitOps
You must set up your own Git repository and GitOps operator to manage cluster deployments.
What you need to build:
- Git repository — Store cluster manifests generated via
clusterctl generate cluster <cluster-name>. Each cluster gets its own set of YAML files containingCluster,MetalStackCluster,KubeadmControlPlane,MachineDeployment, andMetalStackMachineresources. - GitOps operator — Deploy ArgoCD or FluxCD to watch your Git repository and apply manifests to the management cluster, ensuring drift-free declarative delivery.
- Per-cluster CI/CD — Essential components (CNI, CCM) are rolled out on a per-cluster basis. Changes to
MachineTemplateorClusterResourceSetare staged through the Git repository with standard approval processes.
Platform capabilities:
- Cluster migration —
clusterctl moveenables moving workload cluster resources between management clusters, pausing controllers during the move to prevent worker node loss.
Kamaji (Alternative)
Kamaji integrations with metal-stack have not been evaluated in production-grade scenarios. We are actively looking for exchange and adopters — if you are interested in using Kamaji with metal-stack, please join our community to help shape future integration efforts.
Kamaji is a Control Plane Manager for Kubernetes that runs control planes as pods within a management cluster, cutting down on operational overhead and costs. It supports multi-tenancy, high availability, and integrates seamlessly with Cluster API.
Kamaji allows a similar control plane hosting model as Gardener, where the control plane runs on dedicated infrastructure separate from worker nodes.
Kamaji with metal-stack
Kamaji acts as a ControlPlaneProvider with Cluster API, while CAPMS acts as the InfrastructureProvider. This setup manages tenant clusters on metal-stack infrastructure:
- Deploy Kamaji into a management cluster (e.g., a
kindcluster) using the Kamaji on kind tutorial - Install CAPMS as the infrastructure provider into the same management cluster
- Create a control plane VIP — MetalLB assigns a virtual IP in the management cluster's network for the tenant API server
- Create a tenant cluster — Registers the VIP in MetalLB, applies the cluster template via
clusterctl. Kamaji creates the control plane pods in the management cluster - Provision infrastructure — CAPMS provisions firewall and worker machines on metal-stack via the metal-stack API
- Join workers — Machines join via CABPK and kubeadm
- Deploy add-ons — Install CNI (Calico) and
metal-ccminto the tenant cluster
A working showcase is available in the capi-lab setup, which extends the mini-lab with a Kamaji flavor. See our blog post for a detailed walkthrough of the architecture and setup.
Fleet management and GitOps
Since Kamaji with metal-stack uses Cluster-API under the hood (Kamaji as ControlPlaneProvider, CAPMS as InfrastructureProvider), fleet management follows the same pattern as Cluster-API described above. Tenant cluster manifests are generated via clusterctl, stored in Git, and deployed through your CI/CD pipeline.