TL;DR – Quick Takeaways
- StatefulSets can spin up pods, but they stumble when a workload needs coordinated upgrades, backups, or external‑system integration.
- Operators embed domain knowledge directly into the Kubernetes control plane, turning complex lifecycle steps into declarative API calls.
- CRDs define the desired state; a reconciliation loop continuously drives the cluster toward that state, handling failures idempotently.
- Kubebuilder v3.8 and Operator SDK v1.28 both scaffold production‑ready code, but Kubebuilder leans heavily on controller‑runtime while the SDK bundles Helm/Ansible helpers.
- Real‑world operators (e.g., etcd‑backup‑operator) showcase patterns for backup/restore, version upgrades, and safe rollout—capabilities you rarely see in “hello‑world” tutorials.
Before you start, you need:
- A working Kubernetes cluster (v1.26+ recommended).
kubectl1.26+,go1.21, anddocker24.- Familiarity with Custom Resource Definitions and basic Go programming.
- Access to a container registry (Docker Hub, GitHub Packages, or a private repo).
The Limitations of Native Kubernetes Controllers for Stateful Workloads
When you launch a production database on Kubernetes, the first thing you reach for is usually a StatefulSet. It gives you stable network IDs, ordered pod creation, and a PVC per replica. That sounds perfect—until you need to upgrade the cluster without data loss, perform a coordinated backup, or scale down while preserving quorum.
Where StatefulSets Fall Short: Complex Lifecycles and External Dependencies
StatefulSets handle pod ordering but they lack any notion of application‑level state. Imagine a three‑node Cassandra ring. Scaling from three to five nodes requires:
- Adding two new pods in a specific order.
- Streaming data from existing nodes to the newcomers.
- Rebalancing token ranges without causing over‑replication.
A plain StatefulSet will create the pods, but it won’t kick off the streaming or adjust the ring token map. You end up writing ad‑hoc scripts, tying them to postStart hooks, and hoping they survive node restarts. When a network partition occurs, those scripts can leave the ring in an inconsistent state—something the Kubernetes scheduler simply cannot prevent.
⚠️ Warning: Relying on lifecycle hooks for stateful coordination often leads to race conditions and hard‑to‑debug state drift.
The Operator Pattern: Extending the Kubernetes API for Your Domain
Operators turn the “external script” problem into a first‑class citizen of the API. By defining a Custom Resource such as CassandraCluster, you hand the user a single YAML object that describes what the desired cluster looks like. The operator’s controller reads that spec, translates it into a series of safe, idempotent actions, and updates the cluster until reality matches the spec.
💡 Pro Tip: Think of an operator as a state machine that lives inside the cluster, not a batch job running on a CI server.
Anatomy of a Custom Operator: Breaking Down the Components
Building an operator feels like assembling LEGO bricks. Each brick—CRD, controller, scaffold—has a precise role. Let’s dissect them one by one.
Custom Resource Definitions (CRDs): Modeling Your Application State
A CRD tells the API server what fields you expect. For a database, you might expose replicas, backupSchedule, and version. Here’s a minimal snippet for a PostgresCluster CRD using kubebuilder v3.8:
# postgrescluster_crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: postgresclusters.db.example.com
spec:
group: db.example.com
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
replicas:
type: integer
minimum: 1
version:
type: string
pattern: "^\\d+\\.\\d+\\.\\d+$"
backupSchedule:
type: string
pattern: "^[0-9]{2}:[0-9]{2}$"
scope: Namespaced
names:
plural: postgresclusters
singular: postgrescluster
kind: PostgresCluster
shortNames:
- pgc
⚠️ Warning: Forgetting to set
storage: trueon the version you intend to keep will cause the API server to reject updates silently.
The Controller Loop: Reconciling Desired vs. Actual State
The heart of every operator is the reconcile function. Using controller‑runtime v0.14.0, a skeleton looks like this:
// controller/postgrescluster_controller.go
package controller
import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/api/errors"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
dbv1alpha1 "github.com/nileshblog.tech/postgres-operator/api/v1alpha1"
)
// PostgresClusterReconciler reconciles a PostgresCluster object
type PostgresClusterReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// Reconcile implements the main loop
func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx).WithValues("postgrescluster", req.NamespacedName)
// 1. Fetch the CR
var pgc dbv1alpha1.PostgresCluster
if err := r.Get(ctx, req.NamespacedName, &pgc); err != nil {
if errors.IsNotFound(err) {
logger.Info("resource not found, ignoring")
return ctrl.Result{}, nil
}
logger.Error(err, "failed to get PostgresCluster")
return ctrl.Result{}, err
}
// 2. Ensure StatefulSet exists with desired replica count
// (omitted for brevity – imagine createOrUpdateStatefulSet with proper error handling)
// 3. Verify backup schedule and create CronJob if needed
// (omitted – ensure idempotent creation)
logger.Info("reconciliation complete")
return ctrl.Result{RequeueAfter: ctrl.DefaultRequeuePeriod}, nil
}
// SetupWithManager registers the controller
func (r *PostgresClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&dbv1alpha1.PostgresCluster{}).
Complete(r)
}
Notice the idempotent pattern: each step fetches the current object, checks if it already matches the spec, and creates or updates only when necessary. This design prevents endless thrashing.
Operator SDK vs. Kubebuilder: Choosing Your Framework
| Aspect | Operator SDK v1.28 | Kubebuilder v3.8 |
|---|---|---|
| Primary library | controller-runtime + Helm/Ansible plugins | Pure controller-runtime (no Helm/Ansible integration) |
| Ease of scaffolding | Generates Helm chart alongside Go code | Generates only Go code; Helm support is manual |
| Community support | Strong Red Hat backing, good for hybrid workloads | CNCF maintained, lightweight, favoured for pure Go |
| Learning curve | Slightly higher (multiple runtimes) | Steeper at first but more predictable |
| Preferred when | You need a mix of Helm, Ansible, and Go logic | You want a clean Go‑only operator with minimal dependencies |
My own experience shows that for database operators that must talk to external cloud APIs (e.g., provisioning an RDS instance), the Operator SDK’s Helm bridge can be handy for re‑using existing Helm charts. For pure‑Go controllers that interact with etcd directly, Kubebuilder feels snappier.
Design Patterns for Robust Stateful Operators
Building a production‑grade operator is more than stitching together a CRD and a reconcile loop. Certain patterns emerge as non‑negotiable.
Idempotency and Self‑Healing: The Core of Reliable Reconciliation
An idempotent controller guarantees that running the same logic twice yields the same result. Achieve this by:
- Checksum‑based diffing: Store a hash of the spec in the CR’s status and compare it with the actual resource.
- Separate “desired” and “observed” fields: Keep
status.conditionsup to date so you can tell whether a resource needs fixing.
Consider a backup job that may have been partially created:
func (r *PostgresClusterReconciler) ensureBackupCron(ctx context.Context, pgc *dbv1alpha1.PostgresCluster) error {
desired := constructCronJob(pgc) // builds the spec
existing := &batchv1.CronJob{}
err := r.Get(ctx, client.ObjectKey{Name: desired.Name, Namespace: pgc.Namespace}, existing)
if err != nil && !errors.IsNotFound(err) {
return fmt.Errorf("fetching existing CronJob: %w", err)
}
// If not found, create it
if errors.IsNotFound(err) {
if err := r.Create(ctx, desired); err != nil {
return fmt.Errorf("creating backup CronJob: %w", err)
}
return nil
}
// If spec differs, update
if !equality.Semantic.DeepEqual(desired.Spec, existing.Spec) {
existing.Spec = desired.Spec
if err := r.Update(ctx, existing); err != nil {
return fmt.Errorf("updating backup CronJob: %w", err)
}
}
return nil
}
The function always ends with a clean state—no duplicated CronJobs, no orphaned jobs.
Handling Ordered Operations and Rollbacks
Stateful upgrades often require phased rollout. One pattern is to model each phase as a separate sub‑resource (e.g., CassandraCluster.Spec.Phase). The controller checks the current phase and executes the appropriate step:
- Drain old pods (set
pod.spec.terminationGracePeriodSeconds). - Upgrade the container image.
- Validate readiness using a custom probe.
- Promote the new nodes.
If any step fails, the controller can rollback by resetting the phase and re‑applying the previous spec. This approach mirrors the Saga pattern in distributed systems.
Integrating with External Systems and Cloud APIs
Many stateful services rely on cloud‑native services: snapshots in AWS EBS, IAM roles, or DNS records in Cloudflare. Embedding those calls directly inside the reconcile loop, however, can block the controller if the external API is slow.
Best practice:
- Decouple heavy I/O via a work queue (as
controller-runtimedoes). - Implement exponential back‑off with jitter for flaky APIs.
- Persist operation state in the CR’s
statusso the controller can resume after a crash.
Here’s a short example of an AWS snapshot request using the official SDK v2:
import (
"context"
"fmt"
awssdk "github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/ec2"
)
func takeETCDSnapshot(ctx context.Context, volumeID string) (string, error) {
cfg, err := awssdk.LoadDefaultConfig(ctx, awssdk.WithRegion("us-east-1"))
if err != nil {
return "", fmt.Errorf("loading AWS config: %w", err)
}
client := ec2.NewFromConfig(cfg)
out, err := client.CreateSnapshot(ctx, &ec2.CreateSnapshotInput{
VolumeId: &volumeID,
TagSpecifications: []ec2types.TagSpecification{
{
ResourceType: ec2types.ResourceTypeSnapshot,
Tags: []ec2types.Tag{
{Key: aws.String("operator"), Value: aws.String("etcd‑backup‑operator")},
},
},
},
})
if err != nil {
return "", fmt.Errorf("creating snapshot: %w", err)
}
return *out.SnapshotId, nil
}
The operator can call takeETCDSnapshot inside the reconcile loop, but only after it records the request in status.lastBackupRequest to avoid duplicate snapshots.
Real‑World Implementations and Architectural Trade‑offs
Let’s walk through a concrete scenario that we’ve built on nileshblog.tech: a Cassandra operator that manages a 5‑node ring, supports automated backup to S3, and integrates with a custom monitoring stack.
Case Study: Managing a Distributed Database Cluster (Cassandra)
Architecture Diagram
flowchart TD
subgraph K8sCluster[Kubernetes Cluster]
CRD[CRD: CassandraCluster] --> Controller[Controller (Kubebuilder v3.8)]
Controller --> StatefulSet[StatefulSet (5 Pods)]
Controller --> BackupJob[CronJob (S3 Backup)]
Controller --> ConfigMap[ConfigMap (ring‑tokens)]
end
External[External Services] -->|IAM credentials| Controller
External -->|S3 bucket| BackupJob
Alt text: Diagram showing CassandraCluster CRD feeding a Kubebuilder‑based controller, which orchestrates a StatefulSet, Backup CronJob, and ConfigMap, while interacting with external IAM and S3 services.
Key steps performed by the operator
| Phase | Action | Idempotent guard |
|---|---|---|
| Init | Create a headless Service for gossip | Check Service existence |
| Scale‑up | Add a new pod, wait for JOIN status via JMX | Verify node appears in nodetool status |
| Backup | Trigger nodetool snapshot, upload to S3 | Store snapshot ID in status.lastBackup |
| Upgrade | Pause traffic, roll pods one‑by‑one, run nodetool repair | Ensure each pod reaches UP/Normal before proceeding |
| Drain | Decommission pod via nodetool decommission | Confirm token removal from ring map |
💡 Pro Tip: Use
nodetooloutput parsing as a deterministic source of truth rather than assuming pod readiness.
The Observability Gap: Logging, Metrics, and Debugging for Operators
Operators produce their own logs, but you also need metrics about the operator itself (reconcile latency, error rates). The controller-runtime metrics endpoint (/metrics on port 8080) exposes Prometheus‑compatible counters. Add custom collectors:
var (
reconcileDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "postgres_operator_reconcile_seconds",
Help: "Duration of reconcile loops",
Buckets: prometheus.ExponentialBuckets(0.1, 2, 8),
}, []string{"resource", "outcome"})
)
func init() {
prometheus.MustRegister(reconcileDuration)
}
Tie the histogram into the reconcile function:
start := time.Now()
defer func() {
outcome := "success"
if rerr != nil {
outcome = "error"
}
reconcileDuration.WithLabelValues("PostgresCluster", outcome).Observe(time.Since(start).Seconds())
}()
For debugging, enable the controller-runtime --verbose flag and forward the operator logs to a sidecar that ships them to a Loki instance. This practice exposed a subtle race condition in our Cassandra operator when two replicas attempted to join simultaneously.
Trade‑off: Operator Complexity vs. Configuration Management Tools (Helm)
Helm excels at templating static manifests. An operator, however, adds runtime intelligence. The trade‑off matrix looks like this:
| Concern | Helm | Operator |
|---|---|---|
| Simple config (env vars, limits) | ✅ | ✅ |
| Coordinated upgrade with status checks | ❌ | ✅ |
| Automatic backup & restore | ❌ | ✅ |
| Multi‑cluster federation | ❌ | ✅ (via Cluster API) |
| Learning curve | Low | Medium‑High |
| Maintenance overhead | Low (chart versioning) | High (controller code, CI) |
If your service only needs a handful of tunable parameters, Helm may suffice. When you must react to runtime events—like a node losing its PVC—you’ll quickly run into Helm’s limitations.
⚠️ Warning: Mixing Helm and an operator on the same resource can cause reconciliation loops to fight each other. Adopt a clear ownership model: either the operator adopts Helm‑created resources or Helm stays out entirely.
Operationalizing Your Operator: CI/CD and Lifecycle Management
Writing the code is only half the battle. Getting the operator into customers’ clusters safely demands rigorous pipelines.
Testing Strategies: Unit, Integration, and E2E for Operators
- Unit tests – mock the
client.Clientwithcontroller-runtime’s fake client. Verify thatReconcilecallsCreate/Updatecorrectly. Example:
“`go func TestReconcileCreatesStatefulSet(t *testing.T) { scheme := runtime.NewScheme() _ = dbv1alpha1.AddToScheme(scheme) _ = appsv1.AddToScheme(scheme)
fakeClient := fake.NewClientBuilder().WithScheme(scheme).Build()
reconciler := &PostgresClusterReconciler{Client: fakeClient, Scheme: scheme}
// Provide a minimal PostgresCluster CR
pg := &dbv1alpha1.PostgresCluster{
ObjectMeta: metav1.ObjectMeta{Name: "pg-test", Namespace: "default"},
Spec: dbv1alpha1.PostgresClusterSpec{Replicas: 3, Version: "13.4"},
}
_ = fakeClient.Create(context.Background(), pg)
_, err := reconciler.Reconcile(context.Background(), ctrl.Request{NamespacedName: types.NamespacedName{Name: "pg-test", Namespace: "default"}})
if err != nil {
t.Fatalf("reconcile failed: %v", err)
}
// Assert StatefulSet exists
ss := &appsv1.StatefulSet{}
err = fakeClient.Get(context.Background(), client.ObjectKey{Name: "pg-test", Namespace: "default"}, ss)
if err != nil {
t.Fatalf("expected StatefulSet, got error: %v", err)
}
} “`
Integration tests – spin up a real API server using
envtest(released incontroller-runtimev0.14.0). Run the full reconcile cycle against a temporary etcd.E2E tests – use
kubectl‑based scripts (orkuttl) that deploy the operator into a kind cluster, simulate failure scenarios (node loss, network partition), and assert that the custom resource’sstatus.conditionsreflect recovery.
Automate all three layers in GitHub Actions, caching the Go modules and Docker layers for speed.
Versioning CRDs and Managing Schema Evolution
CRD versioning follows semantic versioning principles: a breaking change bumps the API group version (e.g., v1alpha1 → v1beta1). Strategies to keep upgrades smooth:
- Conversion webhook – implements
ConvertTo/ConvertFromto translate old objects to new schema on the fly. - Webhook‑less conversion (Kubebuilder 3.8) – if you keep changes additive (new optional fields), you can skip conversion entirely.
- Preserve unknown fields – set
preserveUnknownFields: falseto force explicit handling of new fields.
On nileshblog.tech we introduced a backupRetention field in v1beta1. The conversion webhook copied the old retentionDays value, ensuring existing clusters kept their backup policies without manual migration.
Security Considerations: RBAC and Admission Webhooks
Operators often need cluster‑wide permissions (e.g., creating PVCs across namespaces). Craft a minimal RBAC policy:
# operator-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cassandra-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "persistentvolumeclaims"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["*"]
- apiGroups: ["db.example.com"]
resources: ["cassandraclusters"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cassandra-operator-binding
subjects:
- kind: ServiceAccount
name: cassandra-operator
namespace: operators
roleRef:
kind: ClusterRole
name: cassandra-operator
apiGroup: rbac.authorization.k8s.io
Add a validating admission webhook to enforce that a CassandraCluster never reduces replicas below the current quorum. The webhook rejects requests that would break data safety, reducing human error.
💡 Pro Tip: Register the webhook with
failurePolicy: Failto guarantee that a mis‑configured webhook does not let unsafe objects slip through.
Common Errors & Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
Operator constantly requeues with RequeueAfter: 0 | Reconcile loop does not set a terminal condition; status never reaches desired state. | Add explicit return ctrl.Result{RequeueAfter: time.Minute}, nil when waiting for external async work. |
Pods created by the operator stay in Pending | PVCs bound to a storage class that lacks enough capacity. | Verify storage class allowVolumeExpansion and provision more capacity or adjust resources.requests.storage. |
| Backup CronJob fires but no snapshot appears in S3 | IAM role attached to the operator pod lacks s3:PutObject. | Grant the necessary permissions in the associated IAM policy and restart the operator. |
| CRD validation error after upgrading to v1beta1 | New required field missing in existing resources. | Use a conversion webhook to set default values for the new field. |
| Operator logs spam “rate limit exceeded” from the API server | Reconcile loop performs API calls without back‑off. | Wrap external calls in Retryer with exponential back‑off (e.g., wait.PollImmediateBackoff). |
Frequently Asked Questions
When should I use a custom operator instead of a Helm chart?
Use Helm for simple, mostly stateless services where configuration can be expressed as a set of values.yaml entries. Choose a custom operator when the application needs state‑aware actions—coordinated upgrades, disaster‑recovery steps, or interaction with external systems—that cannot be captured in static manifests or Helm hooks.
What is the biggest challenge in writing a production‑grade operator?
Ensuring the idempotency and reliability of the reconciliation loop under every failure mode—network partitions, partial updates, or conflicting owner references. A non‑idempotent loop can cause endless thrashing, leading to data corruption.
Can operators work with existing Helm charts or deployments?
Yes. An operator can adopt resources created by Helm by adding an owner reference to the underlying objects. The operator then owns the lifecycle while Helm continues to supply the base manifests. Coordination is essential to avoid two controllers fighting over the same fields.
Call to Action
If this deep dive helped you demystify stateful operators, let me know! Drop a comment, share the article on social media, or subscribe to the newsletter at nileshblog.tech for more hands‑on Kubernetes patterns, code samples, and production stories.
Author Bio:
I’m Nilesh Raut, a Software Development Engineer with 2+ years of experience, specializing in Go, JavaScript, Python, Docker, Kubernetes, Git, Jenkins, microservices, and system design (LLD/HLD), backed by a strong foundation in data structures and algorithms. Alongside my engineering journey, I bring 4+ years of hands‑on experience in SEO, where I’ve worked extensively on content strategy, keyword research, technical SEO, and organic growth, helping products and businesses scale efficiently by aligning solid technology with search‑driven performance.

