etcd — Backup & Restore¶
Reference: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/
What is etcd?¶
etcd is the brain of the Kubernetes cluster. It's a distributed key-value store that holds the entire cluster state — every object you've ever created via kubectl apply is stored here. Pods, deployments, services, configmaps, secrets, RBAC rules, everything.
When you run kubectl get pods, the API server reads from etcd. When you run kubectl apply -f, the API server writes to etcd. If etcd is gone, the cluster has no memory — it doesn't know what should be running.
Why backing up etcd = backing up the entire cluster.
What is the Control Plane?¶
The control plane is the set of components that make Kubernetes work — the "management layer" of the cluster:
| Component | What it does |
|---|---|
kube-apiserver |
The front door — all kubectl commands hit this. Validates and persists to etcd |
etcd |
The state store — holds all cluster data |
kube-scheduler |
Watches for unscheduled pods and assigns them to nodes |
kube-controller-manager |
Runs all the control loops — ReplicaSet controller, Node controller, etc. |
The control plane runs on the controlplane node (sometimes called the master node). Worker nodes just run pods — they have no control plane components.
In kubeadm clusters, control plane components run as static pods in the kube-system namespace.
What is kube-system?¶
kube-system is the namespace where Kubernetes runs its own internal components. When you do kubectl get pods -n kube-system you'll see:
etcd-controlplane
kube-apiserver-controlplane
kube-scheduler-controlplane
kube-controller-manager-controlplane
coredns-...
kube-proxy-...
These are the pods running the cluster's own control plane and networking. You don't manage them directly — they're managed as static pods.
Static Pods — What They Are¶
Static pods are pods managed directly by the kubelet on a node, not by the Kubernetes API server. They're defined as YAML files in a specific directory on the node's filesystem. The kubelet watches that directory and runs whatever it finds there.
This is how the control plane bootstraps itself — the API server, etcd, and scheduler can't be managed by the API server (because it doesn't exist yet when the node starts up). So they're static pods managed by the kubelet directly.
Default static pod directory: /etc/kubernetes/manifests/
Any YAML file you drop in there gets run as a pod. Delete it and the pod stops. Kubelet watches this directory and reconciles constantly.
Finding the Static Pod Path¶
You need to find where static pod manifests live. Don't memorise the path — derive it:
Why this beats memorising /etc/kubernetes/manifests:
- The path is configurable. In the exam, the cluster might use a non-default path
- config.yaml is always there — it's the kubelet's configuration file
- One command gives you the authoritative answer regardless of cluster setup
If you blank on the config.yaml path:
This shows the kubelet process and all its flags. Look for --config=/path/to/config.yaml. That file contains staticPodPath. You can also look for --pod-manifest-path directly in the flags output.
| Method | Reliability |
|---|---|
cat /var/lib/kubelet/config.yaml \| grep staticPodPath |
Best — reads the actual config |
ps aux \| grep kubelet |
Fallback — shows kubelet flags |
Assuming /etc/kubernetes/manifests |
Works in default kubeadm clusters, but fragile |
The Task¶
- SSH into the controlplane node
- Take a snapshot of etcd and save it to
/opt/cluster_backup.db - Restore from that snapshot to data directory
/root/default.etcd - Save the restore console output to
restore.txt
Step 1 — SSH to the Control Plane¶
etcd runs on the controlplane node. All etcd operations must be run there — etcd is not accessible from worker nodes.
In the exam, you start on a bastion/jump host. Tasks specify which node to ssh to.
Step 2 — Find etcd's TLS Certificates¶
etcd requires TLS for all connections. To talk to it, you need:
- --cacert — the CA that signed the server's certificate (to verify you're talking to the real etcd)
- --cert — your client certificate (to authenticate yourself to etcd)
- --key — your client private key
Find them from the etcd static pod manifest:
grep -E "pattern" — extended regex, lets you use | to match multiple patterns in one command.
What each flag in the etcd manifest means:
| Flag in manifest | What it is |
|---|---|
--trusted-ca-file |
The CA certificate — use this for --cacert |
--cert-file |
The server's TLS certificate — use this for --cert |
--key-file |
The server's TLS key — use this for --key |
--listen-client-urls |
The address etcd listens on — typically https://127.0.0.1:2379 |
Standard paths in kubeadm clusters (but always verify from the manifest):
| cert | Path |
|---|---|
| CA cert | /etc/kubernetes/pki/etcd/ca.crt |
| Server cert | /etc/kubernetes/pki/etcd/server.crt |
| Server key | /etc/kubernetes/pki/etcd/server.key |
Step 3 — Take the Backup¶
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /opt/cluster_backup.db
Full breakdown, argument by argument:
| Part | What it does |
|---|---|
ETCDCTL_API=3 |
Environment variable — sets the etcdctl API version to v3. Without this, etcdctl defaults to v2 which uses a completely different data model. Always use v3. |
etcdctl |
The etcd command-line client |
--endpoints=https://127.0.0.1:2379 |
Which etcd instance to connect to. 127.0.0.1:2379 = localhost, port 2379 (etcd's default client port). You're on the controlplane node so localhost is correct. |
--cacert=...ca.crt |
CA certificate to verify etcd's TLS certificate. Without this, the TLS handshake fails — you can't connect. |
--cert=...server.crt |
Your client certificate — proves to etcd that you're authorised to connect. |
--key=...server.key |
Your client private key — the private half of the client cert. |
snapshot save |
The command — take a point-in-time snapshot of all etcd data. |
/opt/cluster_backup.db |
Where to write the snapshot file. .db is conventional but any path works. |
Why all the TLS flags? etcd stores the entire cluster state including secrets. It's locked down with mutual TLS — both sides verify each other. No TLS flags = connection refused.
Step 4 — Verify the Backup¶
etcdutl — a newer utility (separate from etcdctl) for operating on snapshot files offline. Does not connect to etcd — it reads the file directly.
snapshot status — prints metadata about the snapshot: hash, revision, total keys, total size. If the file is corrupt or incomplete, this will error.
Output looks like:
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 62a6e1b1 | 1234 | 987 | 3.2 MB |
+----------+----------+------------+------------+
Verify you see a non-zero revision and key count before proceeding.
Step 5 — Restore the Backup¶
etcdutl snapshot restore /opt/cluster_backup.db \
--data-dir /root/default.etcd \
> restore.txt 2>&1
Full breakdown:
| Part | What it does |
|---|---|
etcdutl snapshot restore |
Restore a snapshot — does NOT connect to a live etcd. Works on files directly. |
/opt/cluster_backup.db |
Input — the snapshot file to restore from |
--data-dir /root/default.etcd |
Output — the directory where etcd's restored data will be written. A new etcd instance pointed at this directory will start up with the restored state. |
> restore.txt |
Redirect stdout to a file |
2>&1 |
Redirect stderr to the same place as stdout (the file). Without this, only stdout goes to the file — error messages still print to terminal and don't get captured. |
/opt/cluster_backup.db vs /root/default.etcd — not the same thing:
- cluster_backup.db = the snapshot file (compressed binary, the backup)
- default.etcd = the data directory (where etcd stores its WAL, snapshots, and state for normal operation)
The restore command reads the backup file and writes a fresh etcd data directory. You then point etcd at that new data directory.
2>&1 explained:
> redirects stdout (file descriptor 1). 2> redirects stderr (file descriptor 2). 2>&1 means "redirect stderr to wherever stdout currently points." Since stdout is already going to restore.txt, stderr goes there too. Result: both stdout and stderr end up in the file.
After Restoring — Update etcd to Use the New Data Dir¶
If you restore to a new directory (/root/default.etcd), you need to tell etcd to use it. Find the etcd static pod manifest and update --data-dir:
Find:
Change to:
Also update the hostPath volume mount that points to the data directory. Kubelet will automatically restart the etcd pod when it detects the manifest changed. The cluster will use the restored data.
Command Summary¶
# 1. SSH to controlplane
ssh controlplane
# 2. Find cert paths (always do this — don't assume)
grep -E "cert-file|key-file|trusted-ca-file|listen-client-urls" /etc/kubernetes/manifests/etcd.yaml
# 3. Find static pod path (good habit)
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# 4. Take snapshot
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /opt/cluster_backup.db
# 5. Verify backup
etcdutl snapshot status /opt/cluster_backup.db
# 6. Restore
etcdutl snapshot restore /opt/cluster_backup.db \
--data-dir /root/default.etcd \
> restore.txt 2>&1
# 7. Update etcd manifest to point at new data dir
vi /etc/kubernetes/manifests/etcd.yaml
# change --data-dir and the volume hostPath to /root/default.etcd
etcdctl vs etcdutl¶
| Tool | What it's for |
|---|---|
etcdctl |
Interacts with a live running etcd cluster. Requires --endpoints and TLS certs. Used for: snapshot save, querying keys, cluster health. |
etcdutl |
Operates on snapshot files directly, offline. No connection needed. Used for: snapshot restore, snapshot status. |
In older versions, etcdctl was used for both. Current etcd (v3.5+) splits them. For the CKA:
- backup → etcdctl snapshot save (connects to live etcd)
- restore → etcdutl snapshot restore (reads file, no live connection)
- verify → etcdutl snapshot status (reads file, no live connection)