Rancher — K8s SRE Reference

TL;DR

Rancher is a centralized control tower for provisioning and operating many Kubernetes clusters. Use it for SSO-backed UI/API, templated downstream clusters (RKE, RKE2, hosted like EKS, or imports), delegated RBAC, and GitOps rollout via Fleet. Keep separation between Rancher-management plane upgrades and downstream cluster lifecycles; pair infrastructure roots with IaC (Terraform) where auditors expect reproducibility.

Importing Downstream Clusters

Imported clusters receive Rancher agents; the management plane proxies auth and orchestrates workloads while Kubernetes remains the reconciliation engine.

bashverify-import-health.sh

# On downstream cluster namespace cattle-system — agents must be Ready.
kubectl -n cattle-system get pods
kubectl get clusterrolebindings | grep cattle-

# Connectivity back to Rancher URL must tolerate corporate proxies/cert trust stores.

RBAC Layers

Rancher identity lives above Kubernetes RBAC — Global / Cluster / Project role bindings map SSO groups to verbs on resources. Understand both layers or you risk “looks allowed in kubectl but blocked in Rancher Projects” paradoxes.

Scope	Governs…	Operational tips
Global	Which clusters/downstreams a principal may see/register	Prefer IdP groups; avoid one-off bindings per user.
Cluster	Cluster-level CRDs, nodes, addons	Align naming with infra team personas (platform vs workload).
Project/Namespace bundles	Delegated namespace collections for app teams	Set resource quotas/network policies upstream in Git when GitOps overlays Rancher abstraction.

yamlrancher-vs-kube-rbac.yaml

# Principle: mirrored Kubernetes Roles still apply — Rancher overlays convenience.
#
# Debugging path:
#   1) Confirm principal's Rancher bindings (global/cluster/project).
#   2) Inspect Kubernetes Roles/ClusterRoles seeded by Rancher.
#   3) Compare impersonation headers when using Rancher shell vs direct kubeconfig.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: example-binding  # illustrative only

Compare with canonical Kubernetes primitives in Namespaces & RBAC.

Fleet & GitOps Placement

Fleet aggregates bundles per cluster selectors and merges Git-derived manifests downstream. Prefer Fleet when teams already consolidate Git manifests but need Rancher tenancy boundaries; alternatively wire GitOps Patterns / ArgoCD directly on each cluster.

Pattern	Pros	Watch-outs
Fleet GitRepo → Bundle → ClusterGroup	Single pane multi-cluster rollout, pause windows	Error surfacing aggregates — debug per-cluster bundle status endpoints.
Hybrid (Fleet infra + cluster-local ArgoCD)	Aligns autonomy with mandated policies	Establish ownership boundaries to avoid duplicated controllers.
Rancher templates + Terraform	Auditable infra + repeatable cluster births	Template drift occurs if UI tweaks skip Git.

yamlfleet-bundle-shape.yaml

# Illustrative Fleet GitRepo grouping — tune labels to your topology.
targets:
  - name: prod-eks-west
    clusterSelector:
      matchLabels:
        env: prod
        region: us-west-2
    defaultNamespace: platform-addons

Fleet GitRepo Object (Illustrative)

Apply from management cluster referencing credentials Secret; reconcile status via kubectl describe gitrepo patterns in Fleet debugging.

yamlfleet-gitrepo-illustrative.yaml

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: platform-addons
  namespace: fleet-local
spec:
  repo: https://github.example.com/org/k8s-addons.git
  branch: main
  clientSecretName: gh-fleet-ro
  paths:
    - overlays/prod/cluster-addons
  targets:
    - name: eks-prod-multi
      clusterSelector:
        matchExpressions:
          - key: provisioning.cattle.io/labels/topology/cluster-tier
            operator: In
            values:
              - platinum
              - gold
      paused: false
      rolloutStrategy:
        maxUnavailable: "15%"
        maxUnavailablePartitions: "0"
      defaultNamespace: platform-addons
      ignorePartitionsInRolloutCalculation: false
  correctDrift:
    enabled: true
forceSyncGeneration: null

Validate webhook-based Git providers rate limits concurrent Fleet polling jobs versus Argo webhook architecture trade-offs spelled out upstream in vendor docs referenced by GitOps Patterns.

HA & Upgrade Discipline

Concern	Mitigation
Management cluster outage	Deploy Rancher atop HA Kubernetes (typically RKE2), external datastore backup, LB health probes.
Certificates / ingress	Use consistent TLS chain across Rancher & Agents; rotations need planned agent restarts.
Rancher → downstream skew	Coordinate chart upgrades with Fleet/Rancher release notes.
Air-gapped	Private registries mirror + offline Helm assets (more ops-heavy).

Versus Vanilla GitOps Tools

Rancher accelerates onboarding of many clusters owned by heterogeneous teams — not a wholesale replacement for Terraform bases or observability pipelines. Decide whether Rancher is the SSO/RBAC/policy edge or simply a cockpit while Git tooling remains source-of-truth for workloads (GitOps Patterns).

Cluster Patterns Rancher Ships

Pattern	Rancher focus	Operational notes
Imported clusters	Bring existing apiserver kubeconfig secrets	Fewest infra moving parts initially; validates agent networking before wider rollout.
Rancher provisioned RKE/RKE2	Control-plane lifecycle via Rancher API	Treat node templates & machine pools like code—mirror into Terraform/GitOps.
Hosted downstream (AKS/EKS/GKE)	Thin wrapper using cloud credential objects	You still reconcile cloud IAM/security groups externally (EKS Deep Dive, IAM).

Provisioning API Resources (Survey)

Useful inventory after upgrades when webhooks collide—adapt API versions to installed Rancher release.

bashrancher-cr-survey.sh

#!/usr/bin/env bash
set -euo pipefail
# Narrow greps illustrative — extend as your distro ships more CRDs.
kubectl api-resources | grep cattle | awk '{print $1 "\t" $NF}' | column -t

echo "=== clusters.management.cattle.io"
kubectl get clusters.management.cattle.io -A 2>/dev/null || echo "CRD missing or empty"

echo "=== clusters.provisioning.cattle.io"
kubectl get clusters.provisioning.cattle.io -A 2>/dev/null || echo "none"

echo "=== clusterrepos"
kubectl get clusterrepos.catalog.cattle.io -n cattle-system || true

echo "=== machinedeployments (downstream provisioning)"
kubectl get machinedeployment -A 2>/dev/null | head || echo "Machines absent on imported downstream"

echo "=== fleet workspaces"
kubectl get namespaces | grep '^fleet-' || true

echo "Tip: correlate CR age with Helm release history:"
helm list -n cattle-system --max 10 || true

Fleet Delivery Debugging

Fleet persists GitRepo CRs in fleet-local namespaces; reconcile errors commonly stem from malformed target selectors or repos lacking reachable credentials from the management plane.

bashfleet-status.sh

kubectl get gitrepos -n fleet-local
kubectl get bundles -A
kubectl describe gitrepo GITREPO_NAME -n fleet-local

kubectl logs -n cattle-fleet-system deploy/fleet-controller --tail=200

💡

Git drift Use the same branching strategy as broader GitOps Patterns—Fleet pause flags are incident tools, not long-lived deployment freezes.

Tenant Isolation Thoughts

Combine Rancher Projects with Kubernetes NetworkPolicies and optional OPA Gatekeeper constraints. Document when teams may import Helm charts interacting with LoadBalancer/Ingress semantics so centralized networking teams see predictable annotation usage.

Control	Pros	Operational debt
Namespaces per workload + Project quotas	Rapid onboarding with guardrails	Growth of idle namespaces without automation.
Cluster segmentation per criticality tier	Cleaner blast isolation	Operational overhead multiplying agents & upgrades.
Fleet ClusterGroups per region	Regional Git bundles	Debugging requires hopping multiple bundle statuses.

Observability & DR Hooks

Expose Rancher + downstream metrics through federated Prometheus or cloud-native watchers. Backups should cover etcd for management clusters alongside Rancher’s app CRs; validate restores include Fleet GitRepo credential secrets and TLS material.

bashagent-throughput-shape.sh

# Watch cattle agents for repeated reconnect loops (proxy/cert expiry).
kubectl -n kube-system logs -l app=cattle-cluster-agent --tail=100 -f --max-log-requests=5

Authentication Hardening

Concern	Guidance
Rancher local accounts	Disable for production; SSO only plus break-glass vault entries.
Privileged shell access	Throttle via Rancher roles + SSO step-up with ticket correlation.
Agent tokens leakage	Rotate when engineers leave infra teams — stored as cluster secrets downstream.
Admission hooks	Align Kyverno/OPA policies with Fleet ordering to avoid deadlock during bundle apply.

Pair with foundational Kubernetes RBAC reading in Namespaces & RBAC; Rancher overlays should never become the undocumented source of privilege truth.

Rancher-Specific Incident Triage

Symptom	Hypothesis funnel	Evidence
Imported cluster “Provisioning” indefinitely	APIServer rejects agent deploy ServiceAccount RBAC bootstrap	APIServer audit logs filtered for cattle-system denies.
Duplicate secrets / helm releases	Fleet + manual Helm colliding namespaces	`helm list -A` vs Fleet bundle manifests.
Webhook latency regressions cluster-wide after Rancher upgrade	Stale CRDs or validating configs	`kubectl get validatingwebhookconfiguration` compare before/after.
Users see empty cluster picker	Stale global role bindings referencing removed IdP group IDs	Rancher audit export + SSO provider group mapping checklist.

Downstream infra still aligns with classical cluster operations like On-Prem Hosting when kubeadm-hosted — Rancher just centralizes ergonomics.

Helm Bootstrap (Management Cluster)

Pin chart + app versions; externalize ingress hostname TLS secrets; dedicate cattle-system quotas to avoid starving agents during Fleet spikes.

yamlrancher-values-fragment.yaml

hostname: rancher.prod.example.com
replicas: 3
ingress:
  ingressClassName: nginx
  tls:
    source: secret
resources:
  requests:
    cpu: 750m
    memory: 1Gi
  limits:
    memory: 2Gi
useBundledSystemChart: true
rancherImage: rancher/rancher
rancherImageTag: v2.9.x-pinned
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app
              operator: In
              values:
                - rancher
        topologyKey: kubernetes.io/hostname
telemetry:
  opt: out
auditLog:
  level: 1
   destination: volume
additionalAnnotations:
  cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

RKE2 Hard Profiles (Brief)

CIS-aligned profiles tighten kube-apiserver flags and syscall defaults—expect breaking changes for permissive Helm charts unless you annotate exceptions consciously. Imported EKS clusters defer CIS enforcement to AWS shared responsibility nuances instead.

Signal	Remediation idea
Admission failures after profile enable	Patch workloads with compliant seccomp/apparmor classes.
Fluent Bit hostPath blocked	Deploy logging agents via supported DaemonSet manifests per vendor guidance.
Stale PodSecurity labels	Ensure Fleet bundles order PSA namespace labels before workloads.

Backup & Restore Shape

bashrancher-dr-checklist.sh

#!/usr/bin/env bash
set -euo pipefail
# Outline only — orchestrate Velero/etcd snapshots consistent with organizational RTO/RPO SLAs.

RANCHER_NS=cattle-system
BACKUP_BUCKET=s3://compliance-dr-rancher

echo "== etcd snapshot coordination (mgmt cluster)"
kubectl get etcdsnapshotfile -A 2>/dev/null || echo "etcd operator CRDs absent — follow cluster-specific backup tooling"

echo "== export GitRepo credential secrets fingerprints"
kubectl get secrets -n fleet-local -o name | while read -r s; do
  kubectl get "$s" -n fleet-local -o yaml | openssl dgst -sha256
done | tee "${TMPDIR:-/tmp}/fleet-secret-manifest.sha256.txt"

echo "== verify downstream registration tokens TTL policy documented"
kubectl get secrets -n cattle-global-data -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp'

echo "upload artifacts"
aws s3 cp "${TMPDIR:-/tmp}/fleet-secret-manifest.sha256.txt" "$BACKUP_BUCKET/daily/" || true

Pair scripted exports with tabletop exercises validating GitOps restores so Fleet definitions resync cleanly post DR.

Day-2 Runbook Bullets

Quarterly validate Rancher SSO group ↔ Global Role binding inventories.
Rotate downstream cluster registration tokens whenever platform teams rotate.
Snapshot Fleet GitRepos before mass pin bumps on chart versions.
Align maintenance windows upstream with downstream EKS platform upgrades.
Correlate Service annotation standards across clusters injected via Fleet overlays.
Track cattle-node-agent DaemonSet rollout status post OS kernel upgrades.
Keep disaster recovery rehearsals covering management cluster etcd AND downstream registration secrets.
Document when teams may bypass Rancher Projects for emergencies—paired with retrospective Git commits.
Monitor Fleet bundle NotReady durations as SLI for GitOps regressions complementing repo CI.
Runbooks should link BOTH Rancher impersonation kubectl AND direct apiserver kubeconfig paths.
Version compatibility matrix pinned next to Helm values in Terraform rendered docs optionally.
Incident bridges: mute noisy agent reconnect warnings only after distinguishing cert vs proxy outages.
Capacity plan management cluster etcd growth from audit logging retention choices.
Cost allocate shared Rancher infra via FinOps tagging mirrored in AWS/GCP/Azure billing exports.
Validate Pod Security Admission defaults per cluster persona before onboarding high-risk workloads.
Establish maximum Rancher-generated shell session duration alerting for compliance auditors.

Fleet & Management Capacity Planning

Count GitRepos × bundle fan-out clusters to estimate Fleet controller reconcile QPS ceilings.
Model cattle-cluster-agent cardinality against Rancher websocket gateway limits documented per release.
Watch etcd database size trending on management cluster before CR counts explode unmanaged.
Plan horizontal scaling thresholds for Fleet controller Pods mirroring Prometheus queue lag metrics.
Align Rancher Ingress bandwidth with SSO IdP SAML/OIDC exchange spikes during simultaneous logins.
Capacity test downstream import storms after acquisitions—validate agent DaemonSet rollout budgets.
Correlate LoadBalancer quotas per downstream cloud impacting exposed Rancher-derived services.
Plan secret backend (Vault/AWS SM) concurrency when Fleet hydrates tens of namespaces simultaneously.
Review Git provider rate-limit budgets shared between Argo forks and Fleet polling intervals.
Schedule chaos drills terminating single management-plane node while agents reconnect.
Document maximum downstream clusters supported per SSO binding before latency complaints surface.
Validate CPU reservations for auditing sidecars scraping Rancher APIs.
Align backup windows with etcd snapshot durations after CR sprawl milestones.
Benchmark Helm upgrade durations for Rancher chart transitions across staging replicas.
Track CRD conversion webhook counts when mixing preview Rancher RC builds with stable downstream.
Coordinate FinOps dashboards linking cluster inventory exports to tagging standards per Terraform outputs.
Schedule quarterly Fleet GitRepo pruning to avoid orphaned bundle reconcilers.
Capacity-gate exploratory UI experiments in sandbox Rancher clones before prod toggles.
Govern screenshot-heavy training materials separately from apiserver-heavy automation tests.
Archive historical cluster registration metadata for compliance timelines tied to mergers.

Gotchas

!Agents offline often means egress/proxy/cert issues — correlate cattle-cluster-agent logs with Rancher server ingress.
!Overlapping Fleet + Argo drift when both reconcile the same namespaces.
!Project quotas without Git backing get overwritten accidentally during Helm experiments.
!Privileged debugging via Rancher shell may violate compliance — gate with SSO + break-glass.
!Imported EKS clusters inherit aws-auth quirks — reconcile Rancher impersonation mappings after IAM SSO changes.
!Version hopping management server without staged downstream upgrades risks CRD webhook incompatibility windows.

Importing Downstream Clusters

RBAC Layers

Fleet & GitOps Placement

Fleet GitRepo Object (Illustrative)

HA & Upgrade Discipline

Versus Vanilla GitOps Tools

Cluster Patterns Rancher Ships

Provisioning API Resources (Survey)

Fleet Delivery Debugging

Tenant Isolation Thoughts

Observability & DR Hooks

Authentication Hardening

Rancher-Specific Incident Triage

Helm Bootstrap (Management Cluster)

RKE2 Hard Profiles (Brief)

Backup & Restore Shape

Day-2 Runbook Bullets

Fleet & Management Capacity Planning

Gotchas

Related Pages