Custom Operators for Managing Stateful Apps on Kubernetes

TL;DR – Quick Takeaways

StatefulSets can spin up pods, but they stumble when a workload needs coordinated upgrades, backups, or external‑system integration.

Operators embed domain knowledge directly into the Kubernetes control plane, turning complex lifecycle steps into declarative API calls.
CRDs define the desired state; a reconciliation loop continuously drives the cluster toward that state, handling failures idempotently.

Kubebuilder v3.8 and Operator SDK v1.28 both scaffold production‑ready code, but Kubebuilder leans heavily on controller‑runtime while the SDK bundles Helm/Ansible helpers.
Real‑world operators (e.g., etcd‑backup‑operator) showcase patterns for backup/restore, version upgrades, and safe rollout—capabilities you rarely see in “hello‑world” tutorials.

Before you start, you need:

A working Kubernetes cluster (v1.26+ recommended).
kubectl 1.26+, go 1.21, and docker 24.
Familiarity with Custom Resource Definitions and basic Go programming.

Access to a container registry (Docker Hub, GitHub Packages, or a private repo).

The Limitations of Native Kubernetes Controllers for Stateful Workloads

When you launch a production database on Kubernetes, the first thing you reach for is usually a StatefulSet. It gives you stable network IDs, ordered pod creation, and a PVC per replica. That sounds perfect—until you need to upgrade the cluster without data loss, perform a coordinated backup, or scale down while preserving quorum.

Where StatefulSets Fall Short: Complex Lifecycles and External Dependencies

StatefulSets handle pod ordering but they lack any notion of application‑level state. Imagine a three‑node Cassandra ring. Scaling from three to five nodes requires:

Adding two new pods in a specific order.
Streaming data from existing nodes to the newcomers.
Rebalancing token ranges without causing over‑replication.

A plain StatefulSet will create the pods, but it won’t kick off the streaming or adjust the ring token map. You end up writing ad‑hoc scripts, tying them to postStart hooks, and hoping they survive node restarts. When a network partition occurs, those scripts can leave the ring in an inconsistent state—something the Kubernetes scheduler simply cannot prevent.

⚠️ Warning: Relying on lifecycle hooks for stateful coordination often leads to race conditions and hard‑to‑debug state drift.

The Operator Pattern: Extending the Kubernetes API for Your Domain

Operators turn the “external script” problem into a first‑class citizen of the API. By defining a Custom Resource such as CassandraCluster, you hand the user a single YAML object that describes what the desired cluster looks like. The operator’s controller reads that spec, translates it into a series of safe, idempotent actions, and updates the cluster until reality matches the spec.

💡 Pro Tip: Think of an operator as a state machine that lives inside the cluster, not a batch job running on a CI server.

Anatomy of a Custom Operator: Breaking Down the Components

Building an operator feels like assembling LEGO bricks. Each brick—CRD, controller, scaffold—has a precise role. Let’s dissect them one by one.

Custom Resource Definitions (CRDs): Modeling Your Application State

A CRD tells the API server what fields you expect. For a database, you might expose replicas, backupSchedule, and version. Here’s a minimal snippet for a PostgresCluster CRD using kubebuilder v3.8:

# postgrescluster_crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: postgresclusters.db.example.com
spec:
  group: db.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicas:
                  type: integer
                  minimum: 1
                version:
                  type: string
                  pattern: "^\\d+\\.\\d+\\.\\d+$"
                backupSchedule:
                  type: string
                  pattern: "^[0-9]{2}:[0-9]{2}$"
  scope: Namespaced
  names:
    plural: postgresclusters
    singular: postgrescluster
    kind: PostgresCluster
    shortNames:
      - pgc

⚠️ Warning: Forgetting to set storage: true on the version you intend to keep will cause the API server to reject updates silently.

The Controller Loop: Reconciling Desired vs. Actual State

The heart of every operator is the reconcile function. Using controller‑runtime v0.14.0, a skeleton looks like this:

// controller/postgrescluster_controller.go
package controller

import (
    "context"
    "fmt"

    "k8s.io/apimachinery/pkg/api/errors"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"
    dbv1alpha1 "github.com/nileshblog.tech/postgres-operator/api/v1alpha1"
)

// PostgresClusterReconciler reconciles a PostgresCluster object
type PostgresClusterReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// Reconcile implements the main loop
func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx).WithValues("postgrescluster", req.NamespacedName)

    // 1. Fetch the CR
    var pgc dbv1alpha1.PostgresCluster
    if err := r.Get(ctx, req.NamespacedName, &pgc); err != nil {
        if errors.IsNotFound(err) {
            logger.Info("resource not found, ignoring")
            return ctrl.Result{}, nil
        }
        logger.Error(err, "failed to get PostgresCluster")
        return ctrl.Result{}, err
    }

    // 2. Ensure StatefulSet exists with desired replica count
    // (omitted for brevity – imagine createOrUpdateStatefulSet with proper error handling)

    // 3. Verify backup schedule and create CronJob if needed
    // (omitted – ensure idempotent creation)

    logger.Info("reconciliation complete")
    return ctrl.Result{RequeueAfter: ctrl.DefaultRequeuePeriod}, nil
}

// SetupWithManager registers the controller
func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&dbv1alpha1.PostgresCluster{}).
        Complete(r)
}

Notice the idempotent pattern: each step fetches the current object, checks if it already matches the spec, and creates or updates only when necessary. This design prevents endless thrashing.

Operator SDK vs. Kubebuilder: Choosing Your Framework

Aspect	Operator SDK v1.28	Kubebuilder v3.8
Primary library	`controller-runtime` + Helm/Ansible plugins	Pure `controller-runtime` (no Helm/Ansible integration)
Ease of scaffolding	Generates Helm chart alongside Go code	Generates only Go code; Helm support is manual
Community support	Strong Red Hat backing, good for hybrid workloads	CNCF maintained, lightweight, favoured for pure Go
Learning curve	Slightly higher (multiple runtimes)	Steeper at first but more predictable
Preferred when	You need a mix of Helm, Ansible, and Go logic	You want a clean Go‑only operator with minimal dependencies

My own experience shows that for database operators that must talk to external cloud APIs (e.g., provisioning an RDS instance), the Operator SDK’s Helm bridge can be handy for re‑using existing Helm charts. For pure‑Go controllers that interact with etcd directly, Kubebuilder feels snappier.

Design Patterns for Robust Stateful Operators

Building a production‑grade operator is more than stitching together a CRD and a reconcile loop. Certain patterns emerge as non‑negotiable.

Idempotency and Self‑Healing: The Core of Reliable Reconciliation

An idempotent controller guarantees that running the same logic twice yields the same result. Achieve this by:

Checksum‑based diffing: Store a hash of the spec in the CR’s status and compare it with the actual resource.
Separate “desired” and “observed” fields: Keep status.conditions up to date so you can tell whether a resource needs fixing.

Consider a backup job that may have been partially created:

func (r *PostgresClusterReconciler) ensureBackupCron(ctx context.Context, pgc *dbv1alpha1.PostgresCluster) error {
    desired := constructCronJob(pgc) // builds the spec
    existing := &batchv1.CronJob{}
    err := r.Get(ctx, client.ObjectKey{Name: desired.Name, Namespace: pgc.Namespace}, existing)
    if err != nil && !errors.IsNotFound(err) {
        return fmt.Errorf("fetching existing CronJob: %w", err)
    }

    // If not found, create it
    if errors.IsNotFound(err) {
        if err := r.Create(ctx, desired); err != nil {
            return fmt.Errorf("creating backup CronJob: %w", err)
        }
        return nil
    }

    // If spec differs, update
    if !equality.Semantic.DeepEqual(desired.Spec, existing.Spec) {
        existing.Spec = desired.Spec
        if err := r.Update(ctx, existing); err != nil {
            return fmt.Errorf("updating backup CronJob: %w", err)
        }
    }
    return nil
}

The function always ends with a clean state—no duplicated CronJobs, no orphaned jobs.

Handling Ordered Operations and Rollbacks

Stateful upgrades often require phased rollout. One pattern is to model each phase as a separate sub‑resource (e.g., CassandraCluster.Spec.Phase). The controller checks the current phase and executes the appropriate step:

Drain old pods (set pod.spec.terminationGracePeriodSeconds).
Upgrade the container image.
Validate readiness using a custom probe.
Promote the new nodes.

If any step fails, the controller can rollback by resetting the phase and re‑applying the previous spec. This approach mirrors the Saga pattern in distributed systems.

Integrating with External Systems and Cloud APIs

Many stateful services rely on cloud‑native services: snapshots in AWS EBS, IAM roles, or DNS records in Cloudflare. Embedding those calls directly inside the reconcile loop, however, can block the controller if the external API is slow.

Best practice:

Decouple heavy I/O via a work queue (as controller-runtime does).
Implement exponential back‑off with jitter for flaky APIs.
Persist operation state in the CR’s status so the controller can resume after a crash.

Here’s a short example of an AWS snapshot request using the official SDK v2:

import (
    "context"
    "fmt"

    awssdk "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/ec2"
)

func takeETCDSnapshot(ctx context.Context, volumeID string) (string, error) {
    cfg, err := awssdk.LoadDefaultConfig(ctx, awssdk.WithRegion("us-east-1"))
    if err != nil {
        return "", fmt.Errorf("loading AWS config: %w", err)
    }
    client := ec2.NewFromConfig(cfg)

    out, err := client.CreateSnapshot(ctx, &ec2.CreateSnapshotInput{
        VolumeId: &volumeID,
        TagSpecifications: []ec2types.TagSpecification{
            {
                ResourceType: ec2types.ResourceTypeSnapshot,
                Tags: []ec2types.Tag{
                    {Key: aws.String("operator"), Value: aws.String("etcd‑backup‑operator")},
                },
            },
        },
    })
    if err != nil {
        return "", fmt.Errorf("creating snapshot: %w", err)
    }
    return *out.SnapshotId, nil
}

The operator can call takeETCDSnapshot inside the reconcile loop, but only after it records the request in status.lastBackupRequest to avoid duplicate snapshots.

Real‑World Implementations and Architectural Trade‑offs

Let’s walk through a concrete scenario that we’ve built on nileshblog.tech: a Cassandra operator that manages a 5‑node ring, supports automated backup to S3, and integrates with a custom monitoring stack.

Case Study: Managing a Distributed Database Cluster (Cassandra)

Architecture Diagram

flowchart TD
    subgraph K8sCluster[Kubernetes Cluster]
        CRD[CRD: CassandraCluster] --> Controller[Controller (Kubebuilder v3.8)]
        Controller --> StatefulSet[StatefulSet (5 Pods)]
        Controller --> BackupJob[CronJob (S3 Backup)]
        Controller --> ConfigMap[ConfigMap (ring‑tokens)]
    end
    External[External Services] -->|IAM credentials| Controller
    External -->|S3 bucket| BackupJob

Alt text: Diagram showing CassandraCluster CRD feeding a Kubebuilder‑based controller, which orchestrates a StatefulSet, Backup CronJob, and ConfigMap, while interacting with external IAM and S3 services.

Key steps performed by the operator

Phase	Action	Idempotent guard
Init	Create a headless Service for gossip	Check Service existence
Scale‑up	Add a new pod, wait for `JOIN` status via JMX	Verify node appears in `nodetool status`
Backup	Trigger `nodetool snapshot`, upload to S3	Store snapshot ID in `status.lastBackup`
Upgrade	Pause traffic, roll pods one‑by‑one, run `nodetool repair`	Ensure each pod reaches `UP/Normal` before proceeding
Drain	Decommission pod via `nodetool decommission`	Confirm token removal from ring map

💡 Pro Tip: Use nodetool output parsing as a deterministic source of truth rather than assuming pod readiness.

The Observability Gap: Logging, Metrics, and Debugging for Operators

Operators produce their own logs, but you also need metrics about the operator itself (reconcile latency, error rates). The controller-runtime metrics endpoint (/metrics on port 8080) exposes Prometheus‑compatible counters. Add custom collectors:

var (
    reconcileDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "postgres_operator_reconcile_seconds",
        Help:    "Duration of reconcile loops",
        Buckets: prometheus.ExponentialBuckets(0.1, 2, 8),
    }, []string{"resource", "outcome"})
)

func init() {
    prometheus.MustRegister(reconcileDuration)
}

Tie the histogram into the reconcile function:

start := time.Now()
defer func() {
    outcome := "success"
    if rerr != nil {
        outcome = "error"
    }
    reconcileDuration.WithLabelValues("PostgresCluster", outcome).Observe(time.Since(start).Seconds())
}()

For debugging, enable the controller-runtime --verbose flag and forward the operator logs to a sidecar that ships them to a Loki instance. This practice exposed a subtle race condition in our Cassandra operator when two replicas attempted to join simultaneously.

Trade‑off: Operator Complexity vs. Configuration Management Tools (Helm)

Helm excels at templating static manifests. An operator, however, adds runtime intelligence. The trade‑off matrix looks like this:

Concern	Helm	Operator
Simple config (env vars, limits)	✅	✅
Coordinated upgrade with status checks	❌	✅
Automatic backup & restore	❌	✅
Multi‑cluster federation	❌	✅ (via Cluster API)
Learning curve	Low	Medium‑High
Maintenance overhead	Low (chart versioning)	High (controller code, CI)

If your service only needs a handful of tunable parameters, Helm may suffice. When you must react to runtime events—like a node losing its PVC—you’ll quickly run into Helm’s limitations.

⚠️ Warning: Mixing Helm and an operator on the same resource can cause reconciliation loops to fight each other. Adopt a clear ownership model: either the operator adopts Helm‑created resources or Helm stays out entirely.

Operationalizing Your Operator: CI/CD and Lifecycle Management

Writing the code is only half the battle. Getting the operator into customers’ clusters safely demands rigorous pipelines.

Testing Strategies: Unit, Integration, and E2E for Operators

Unit tests – mock the client.Client with controller-runtime’s fake client. Verify that Reconcile calls Create/Update correctly. Example:

“`go func TestReconcileCreatesStatefulSet(t *testing.T) { scheme := runtime.NewScheme() _ = dbv1alpha1.AddToScheme(scheme) _ = appsv1.AddToScheme(scheme)

   fakeClient := fake.NewClientBuilder().WithScheme(scheme).Build()
   reconciler := &PostgresClusterReconciler{Client: fakeClient, Scheme: scheme}

   // Provide a minimal PostgresCluster CR
   pg := &dbv1alpha1.PostgresCluster{
       ObjectMeta: metav1.ObjectMeta{Name: "pg-test", Namespace: "default"},
       Spec: dbv1alpha1.PostgresClusterSpec{Replicas: 3, Version: "13.4"},
   }
   _ = fakeClient.Create(context.Background(), pg)

   _, err := reconciler.Reconcile(context.Background(), ctrl.Request{NamespacedName: types.NamespacedName{Name: "pg-test", Namespace: "default"}})
   if err != nil {
       t.Fatalf("reconcile failed: %v", err)
   }

   // Assert StatefulSet exists
   ss := &appsv1.StatefulSet{}
   err = fakeClient.Get(context.Background(), client.ObjectKey{Name: "pg-test", Namespace: "default"}, ss)
   if err != nil {
       t.Fatalf("expected StatefulSet, got error: %v", err)
   }

} “`

Integration tests – spin up a real API server using envtest (released in controller-runtime v0.14.0). Run the full reconcile cycle against a temporary etcd.
E2E tests – use kubectl‑based scripts (or kuttl) that deploy the operator into a kind cluster, simulate failure scenarios (node loss, network partition), and assert that the custom resource’s status.conditions reflect recovery.

Automate all three layers in GitHub Actions, caching the Go modules and Docker layers for speed.

Versioning CRDs and Managing Schema Evolution

CRD versioning follows semantic versioning principles: a breaking change bumps the API group version (e.g., v1alpha1 → v1beta1). Strategies to keep upgrades smooth:

Conversion webhook – implements ConvertTo/ConvertFrom to translate old objects to new schema on the fly.

Webhook‑less conversion (Kubebuilder 3.8) – if you keep changes additive (new optional fields), you can skip conversion entirely.
Preserve unknown fields – set preserveUnknownFields: false to force explicit handling of new fields.

On nileshblog.tech we introduced a backupRetention field in v1beta1. The conversion webhook copied the old retentionDays value, ensuring existing clusters kept their backup policies without manual migration.

Security Considerations: RBAC and Admission Webhooks

Operators often need cluster‑wide permissions (e.g., creating PVCs across namespaces). Craft a minimal RBAC policy:

# operator-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cassandra-operator
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["*"]
  - apiGroups: ["db.example.com"]
    resources: ["cassandraclusters"]
    verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cassandra-operator-binding
subjects:
  - kind: ServiceAccount
    name: cassandra-operator
    namespace: operators
roleRef:
  kind: ClusterRole
  name: cassandra-operator
  apiGroup: rbac.authorization.k8s.io

Add a validating admission webhook to enforce that a CassandraCluster never reduces replicas below the current quorum. The webhook rejects requests that would break data safety, reducing human error.

💡 Pro Tip: Register the webhook with failurePolicy: Fail to guarantee that a mis‑configured webhook does not let unsafe objects slip through.

Common Errors & Fixes

Symptom	Likely Cause	Fix
Operator constantly requeues with `RequeueAfter: 0`	Reconcile loop does not set a terminal condition; status never reaches desired state.	Add explicit `return ctrl.Result{RequeueAfter: time.Minute}, nil` when waiting for external async work.
Pods created by the operator stay in `Pending`	PVCs bound to a storage class that lacks enough capacity.	Verify storage class `allowVolumeExpansion` and provision more capacity or adjust `resources.requests.storage`.
Backup CronJob fires but no snapshot appears in S3	IAM role attached to the operator pod lacks `s3:PutObject`.	Grant the necessary permissions in the associated IAM policy and restart the operator.
CRD validation error after upgrading to v1beta1	New required field missing in existing resources.	Use a conversion webhook to set default values for the new field.
Operator logs spam “rate limit exceeded” from the API server	Reconcile loop performs API calls without back‑off.	Wrap external calls in `Retryer` with exponential back‑off (e.g., `wait.PollImmediateBackoff`).

Frequently Asked Questions

When should I use a custom operator instead of a Helm chart?

Use Helm for simple, mostly stateless services where configuration can be expressed as a set of values.yaml entries. Choose a custom operator when the application needs state‑aware actions—coordinated upgrades, disaster‑recovery steps, or interaction with external systems—that cannot be captured in static manifests or Helm hooks.

What is the biggest challenge in writing a production‑grade operator?

Ensuring the idempotency and reliability of the reconciliation loop under every failure mode—network partitions, partial updates, or conflicting owner references. A non‑idempotent loop can cause endless thrashing, leading to data corruption.

Can operators work with existing Helm charts or deployments?

Yes. An operator can adopt resources created by Helm by adding an owner reference to the underlying objects. The operator then owns the lifecycle while Helm continues to supply the base manifests. Coordination is essential to avoid two controllers fighting over the same fields.

Call to Action

If this deep dive helped you demystify stateful operators, let me know! Drop a comment, share the article on social media, or subscribe to the newsletter at nileshblog.tech for more hands‑on Kubernetes patterns, code samples, and production stories.

Author Bio:
I’m Nilesh Raut, a Software Development Engineer with 2+ years of experience, specializing in Go, JavaScript, Python, Docker, Kubernetes, Git, Jenkins, microservices, and system design (LLD/HLD), backed by a strong foundation in data structures and algorithms. Alongside my engineering journey, I bring 4+ years of hands‑on experience in SEO, where I’ve worked extensively on content strategy, keyword research, technical SEO, and organic growth, helping products and businesses scale efficiently by aligning solid technology with search‑driven performance.

Written by

Nilesh Raut

’m Nilesh, a Software Development Engineer with 2+ years of experience, specializing in Go, JavaScript, Python, Docker, Kubernetes, Git, Jenkins, microservices, and system design (LLD/HLD), backed by a strong foundation in data structures and algorithms. Alongside my engineering journey, I bring 4+ years of hands-on experience in SEO, where I’ve worked extensively on content strategy, keyword research, technical SEO, and organic growth, helping products and businesses scale efficiently by aligning solid technology with search-driven performance.