Back to BlogKubernetes & Containers

Kubernetes in Production: The Best Practices That Actually Matter

Move beyond tutorials and learn the Kubernetes production patterns that keep clusters reliable at scale. Covers resource management, security hardening, observability, GitOps, and disaster recovery.

March 18, 20269 min readBy CloudaQube Team
Kubernetes production cluster architecture with monitoring and security layers

The Gap Between Tutorials and Production

Deploying an Nginx pod to a local Kubernetes cluster takes five minutes. Running a production Kubernetes platform that handles real traffic, real failures, and real security threats is an entirely different discipline.

Kubernetes is the number one searched DevOps topic on Pluralsight, with over 4 million tutorial views on YouTube. But here's the problem: most content stops at kubectl apply and never addresses what happens when your cluster is serving 10,000 requests per second at 3 AM and a node goes down.

This guide covers the production patterns that separate demo clusters from reliable infrastructure. Every recommendation here comes from real operational experience — the kind of knowledge that takes years of on-call rotations to accumulate.

Resource Management: The First Thing That Breaks

Resource misconfiguration is the number one cause of Kubernetes production incidents. Either pods don't get enough resources and degrade under load, or they get too much and starve other workloads.

Always Set Resource Requests

Every container in every pod should have resource requests defined. Requests tell the scheduler how much CPU and memory the container needs, and the scheduler uses this to place pods on nodes with sufficient capacity.

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    memory: "512Mi"

Set memory limits. Be careful with CPU limits. Memory limits prevent a single misbehaving container from OOM-killing everything on the node. CPU limits are more controversial — they can cause throttling even when the node has spare CPU capacity. Many production teams set CPU requests but omit CPU limits, relying on the scheduler for fair distribution.

Use LimitRanges and ResourceQuotas

In multi-tenant clusters, enforce guardrails at the namespace level:

  • LimitRanges set default requests/limits for containers that don't specify them and cap maximum resource claims per container.
  • ResourceQuotas cap total resource consumption per namespace so one team can't monopolize the cluster.
!

The Pod Without Requests

A pod without resource requests is treated as BestEffort QoS class. It's the first to be evicted when a node runs low on resources. In production, this means your most important workloads can be killed because someone forgot to add two lines of YAML. Use LimitRanges to set defaults and never let a pod run without requests.

Horizontal Pod Autoscaling

Static replica counts don't survive traffic spikes. Configure HPA for any workload with variable demand:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Set minReplicas to at least 3 for high-availability workloads. One pod handles traffic, one is ready for failover, and the third covers rolling updates. Scale based on CPU or custom metrics depending on your workload.

Security Hardening

A default Kubernetes cluster is not secure. The defaults prioritize ease of use over security. Production clusters need deliberate hardening.

Pod Security Standards

Enforce Pod Security Standards at the namespace level to prevent pods from running with dangerous privileges:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

The restricted profile prevents running as root, disallows privilege escalation, requires a read-only root filesystem, and blocks host networking and privileged containers. Start with baseline if restricted is too disruptive for your existing workloads.

Network Policies

By default, every pod can talk to every other pod. In production, implement deny-by-default network policies and explicitly allow only the traffic your application needs:

# Default deny all ingress in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Then add specific policies to allow legitimate traffic paths. This limits the blast radius of a compromised pod.

RBAC Best Practices

  • Never use cluster-admin for workloads. Create namespace-scoped Roles with the minimum necessary permissions.
  • Use ServiceAccounts per workload, not the default ServiceAccount. Disable automounting of ServiceAccount tokens for pods that don't need API access.
  • Audit RBAC regularly. Use kubectl auth can-i --list --as=system:serviceaccount:ns:sa to verify actual permissions.

Image Security

Only pull images from trusted registries. Use image digests (image: nginx@sha256:...) instead of mutable tags in production. Run an admission controller like Kyverno or OPA Gatekeeper to enforce image policies cluster-wide. A compromised image tag is one of the easiest attack vectors in Kubernetes.

Observability: You Can't Fix What You Can't See

Production Kubernetes requires three pillars of observability: metrics, logs, and traces.

Metrics with Prometheus

Prometheus is the de facto standard for Kubernetes metrics. Deploy it with the kube-prometheus-stack Helm chart, which includes Prometheus, Grafana, and pre-built dashboards for cluster and workload monitoring.

Key metrics to alert on:

  • Node: CPU utilization > 80%, memory utilization > 85%, disk pressure, not-ready status
  • Pod: Restart count increasing, OOMKilled events, pending pods > 5 minutes
  • Application: Request latency p99, error rate, request rate (the RED method)
  • Cluster: API server latency, etcd leader changes, scheduler failures

For a complete walkthrough of setting up Prometheus and Grafana on Kubernetes, see our Kubernetes monitoring guide.

Structured Logging

Ensure all applications log in JSON format. Deploy a log aggregation stack (Loki, EFK, or a managed service) to centralize logs across all pods. Key practices:

  • Include correlation IDs in every log line for distributed tracing.
  • Set log levels appropriately — INFO in production, DEBUG only when troubleshooting.
  • Don't log to files inside containers. Write to stdout/stderr and let the Kubernetes logging pipeline handle collection.

Distributed Tracing

For microservice architectures, deploy OpenTelemetry collectors to capture distributed traces. Traces show you the full request path across services and pinpoint exactly where latency or failures originate.

GitOps: Declarative Cluster Management

Managing production clusters with kubectl apply from a laptop doesn't scale. GitOps uses Git as the single source of truth for cluster state.

How GitOps Works

  1. All Kubernetes manifests live in a Git repository.
  2. A GitOps operator (Argo CD or Flux) watches the repository.
  3. When manifests change in Git, the operator applies the changes to the cluster.
  4. The operator continuously reconciles cluster state with the Git repository.

Why GitOps Matters for Production

  • Audit trail: Every change is a Git commit with an author, timestamp, and description.
  • Rollback: Reverting a bad deployment is git revert. No need to remember what the previous state looked like.
  • Consistency: No manual kubectl commands that create drift between what's in Git and what's running.
  • Access control: Developers submit pull requests. Only the GitOps operator has cluster write access.
i

Argo CD vs. Flux

Both are mature, CNCF-graduated projects. Argo CD has a web UI for visualization and is slightly easier to get started with. Flux is more modular and integrates tightly with Helm and Kustomize. Pick either — the GitOps pattern matters more than the tool choice.

High Availability and Disaster Recovery

Control Plane HA

  • Run at least 3 control plane nodes across different availability zones.
  • Use an external etcd cluster or ensure etcd runs on all control plane nodes with proper backup schedules.
  • Place an internal load balancer in front of the API servers.

Application-Level HA

  • Run at least 3 replicas for critical workloads.
  • Use pod anti-affinity to spread replicas across nodes and AZs.
  • Configure Pod Disruption Budgets to prevent too many pods from being evicted simultaneously during node maintenance.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

etcd Backup Strategy

Back up etcd automatically on a schedule. Store backups outside the cluster (S3, GCS, Azure Blob). Test restoration quarterly. An etcd failure without a backup means rebuilding the entire cluster from scratch.

Upgrade Strategy

Kubernetes releases a new minor version every four months, and each version is supported for 14 months. Staying current is non-negotiable — running unsupported versions means no security patches.

Upgrade Safely

  1. Read the changelog. Every release has deprecations and breaking changes.
  2. Upgrade non-production first. Validate your workloads work on the new version before touching production.
  3. Upgrade one minor version at a time. Never skip versions (e.g., 1.28 → 1.30).
  4. Use node pool rolling updates. Drain and replace nodes one at a time to maintain availability.
  5. Run admission webhook dry-runs to catch manifest incompatibilities before applying.

Production Readiness Checklist

Before going to production, verify every item:

  • Resource requests and limits set on all containers
  • LimitRanges and ResourceQuotas configured per namespace
  • HPA configured for variable workloads
  • Pod Security Standards enforced
  • Network policies in place (default deny + explicit allows)
  • RBAC configured with least privilege
  • Images pulled from trusted registries with digest pinning
  • Prometheus metrics and alerting configured
  • Centralized logging deployed
  • GitOps operator managing deployments
  • etcd backup schedule configured and tested
  • Pod Disruption Budgets set for critical workloads
  • Pod anti-affinity spreading replicas across AZs
  • Ingress with TLS termination configured
  • Secrets encrypted at rest in etcd

Conclusion

Running Kubernetes in production is an ongoing discipline, not a one-time setup. The patterns in this guide — resource management, security hardening, observability, GitOps, and HA — form the foundation that everything else builds on.

Start with resource requests and security hardening. Add observability so you can see what's happening. Implement GitOps to manage change safely. Then refine your HA and DR strategy based on your specific availability requirements.

The teams that run Kubernetes successfully treat it as a platform, not just a deployment target. They invest in tooling, automation, and operational practices that make the platform reliable for every team that deploys to it. If you're preparing to validate your Kubernetes skills with a certification, our CKA study guide covers the exam-specific knowledge that complements these production practices.

Want to practice this hands-on?

CloudaQube generates complete labs from a simple description. Try it free.

Get Started Free
Share:
C

CloudaQube Team

Cloud Infrastructure Engineers

Level up your cloud skills

Get hands-on with AI-generated labs tailored to your skill level. Practice AWS, Azure, Kubernetes, and more.

Start Learning Free