Skip to main content

Module 09 — Bootstrap the etcd Cluster

etcd is a distributed key-value store that holds the entire Kubernetes cluster state — every pod, service, deployment, secret, and config map. When you run kubectl get pods, the API server reads from etcd. When you create a deployment, the API server writes to etcd. If etcd is down, the API server cannot function.

In this module you install etcd on both control plane nodes (cp1, cp2), configure TLS for both peer communication (etcd-to-etcd) and client access (API server-to-etcd), and verify the cluster is healthy.


1. Why a Cluster, Not a Single Node

A single etcd instance is a single point of failure. If it dies, the entire Kubernetes cluster becomes read-only (or worse, unresponsive). Clustering provides:

  • Replication — data is replicated across members
  • Consensus — writes require agreement from a quorum (majority of members)
  • Availability — the cluster continues serving if a minority of members fail

Quorum and the 2-node trade-off

Cluster sizeQuorumTolerates failures
110 (single point of failure)
220 (both must be up for writes)
321
532

Your 2-node cluster requires both members to be up for writes. If one node goes down, the cluster becomes read-only. This is acceptable for training — you learn the clustering mechanics — but production clusters use 3 or 5 members.

Why not 3 in this training? Running 3 control plane VMs would push the total to 6 VMs and ~12 GB RAM. Two control planes teach the same concepts with less resource overhead.


2. Download and Install etcd

Run these steps on both cp1 and cp2. SSH into each node:

ssh cp1

2.1 Download etcd

ETCD_VER=v3.5.16

curl -sL "https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz" -o /tmp/etcd.tar.gz

2.2 Extract and install

tar -xzf /tmp/etcd.tar.gz -C /tmp/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcd /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdctl /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdutl /usr/local/bin/
rm -rf /tmp/etcd*

2.3 Verify

etcd --version
etcdctl version

Expected:

etcd Version: 3.5.16
...
etcdctl version: 3.5.16

Repeat these steps on cp2 before continuing.

Checkpoint: etcd --version returns 3.5.16 on both cp1 and cp2.


3. Prepare the Configuration

Run these steps on both cp1 and cp2.

3.1 Create directories

sudo mkdir -p /etc/etcd /var/lib/etcd
sudo chmod 700 /var/lib/etcd
  • /etc/etcd — TLS certificates and configuration
  • /var/lib/etcd — data directory (where etcd stores its database). The restrictive permissions prevent other users from reading cluster data.

3.2 Copy certificates

The certificates were distributed to ~/ in Module 07. Move them to /etc/etcd:

sudo cp ~/ca.pem ~/etcd.pem ~/etcd-key.pem /etc/etcd/

3.3 Set environment variables

Each node needs to know its own name and IP. These variables are used by the systemd unit file.

On cp1:

ETCD_NAME=cp1
INTERNAL_IP=192.168.56.21

On cp2:

ETCD_NAME=cp2
INTERNAL_IP=192.168.56.22

Checkpoint: Certificates exist in /etc/etcd/ on both nodes.


4. Create the systemd Unit File

This is the most flag-heavy configuration in the entire track. Each flag has a specific purpose — read through the explanations before creating the file.

Run on each node (with the correct ETCD_NAME and INTERNAL_IP variables set from Section 3.3):

cat <<EOF | sudo tee /etc/systemd/system/etcd.service
[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \\
--name ${ETCD_NAME} \\
--cert-file=/etc/etcd/etcd.pem \\
--key-file=/etc/etcd/etcd-key.pem \\
--peer-cert-file=/etc/etcd/etcd.pem \\
--peer-key-file=/etc/etcd/etcd-key.pem \\
--trusted-ca-file=/etc/etcd/ca.pem \\
--peer-trusted-ca-file=/etc/etcd/ca.pem \\
--peer-client-cert-auth \\
--client-cert-auth \\
--initial-advertise-peer-urls https://${INTERNAL_IP}:2380 \\
--listen-peer-urls https://${INTERNAL_IP}:2380 \\
--listen-client-urls https://${INTERNAL_IP}:2379,https://127.0.0.1:2379 \\
--advertise-client-urls https://${INTERNAL_IP}:2379 \\
--initial-cluster-token etcd-cluster-0 \\
--initial-cluster cp1=https://192.168.56.21:2380,cp2=https://192.168.56.22:2380 \\
--initial-cluster-state new \\
--data-dir=/var/lib/etcd
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

What each flag does

TLS flags:

FlagPurpose
--cert-fileServer certificate for client connections
--key-filePrivate key for the server certificate
--peer-cert-fileCertificate for peer-to-peer connections
--peer-key-filePrivate key for peer certificate
--trusted-ca-fileCA certificate to verify client certificates
--peer-trusted-ca-fileCA certificate to verify peer certificates
--client-cert-authRequire clients to present a valid certificate
--peer-client-cert-authRequire peers to present a valid certificate

Clustering flags:

FlagPurpose
--nameThis member's name (must be unique in the cluster)
--initial-advertise-peer-urlsURL this member advertises to peers
--listen-peer-urlsAddress to listen for peer connections (port 2380)
--listen-client-urlsAddresses to listen for client connections (port 2379)
--advertise-client-urlsURL clients should use to connect to this member
--initial-clusterAll members and their peer URLs (bootstrap config)
--initial-cluster-tokenShared token to prevent cross-cluster connections
--initial-cluster-statenew for first-time bootstrap
--data-dirWhere to store the etcd database on disk

Port assignments

PortProtocolUsed for
2379HTTPSClient connections (API server → etcd)
2380HTTPSPeer connections (etcd node → etcd node)

Checkpoint: /etc/systemd/system/etcd.service exists on both cp1 and cp2.


5. Start the etcd Cluster

Both members must start in close succession because the cluster waits for quorum during initial bootstrap.

5.1 Start etcd on both nodes

On cp1:

sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd

On cp2 (immediately after):

sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd

The first node to start will wait for the second member to join. Once both are up, the cluster forms and elects a leader.

5.2 Check the service status

On each node:

sudo systemctl status etcd

Expected: Active: active (running).

If the service is not running, check the logs:

sudo journalctl -u etcd --no-pager -l | tail -30

Common issues at this stage:

  • Certificate file not found → verify /etc/etcd/*.pem exists
  • Connection refused on peer port → the other node is not started yet
  • Bind address in use → another process is on port 2379 or 2380

Checkpoint: sudo systemctl status etcd shows active (running) on both cp1 and cp2.


6. Verify Cluster Health

6.1 Member list

From either cp1 or cp2:

sudo ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table

Expected output:

+------------------+---------+------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------+----------------------------+----------------------------+------------+
| xxxxxxxxxxxx | started | cp1 | https://192.168.56.21:2380 | https://192.168.56.21:2379 | false |
| yyyyyyyyyyyy | started | cp2 | https://192.168.56.22:2380 | https://192.168.56.22:2379 | false |
+------------------+---------+------+----------------------------+----------------------------+------------+

Both members show started status.

6.2 Endpoint health

sudo ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://192.168.56.21:2379,https://192.168.56.22:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table

Expected output:

+----------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+----------------------------+--------+-------------+-------+
| https://192.168.56.21:2379 | true | 10.123ms | |
| https://192.168.56.22:2379 | true | 11.456ms | |
+----------------------------+--------+-------------+-------+

Both endpoints are healthy.

6.3 Endpoint status (shows leader)

sudo ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://192.168.56.21:2379,https://192.168.56.22:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem \
--write-out=table

One of the members will show true in the IS LEADER column. The other shows false. The leader handles all write operations and replicates them to the follower.

Checkpoint: etcdctl member list shows 2 members with started status. etcdctl endpoint health shows both healthy.


7. Create a Helper Alias

The etcdctl commands are verbose because of the TLS flags. Create an alias to simplify:

On both cp1 and cp2:

cat >> ~/.bashrc <<'EOF'

# etcdctl with TLS flags
alias etcdctl='sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.pem \
--cert=/etc/etcd/etcd.pem \
--key=/etc/etcd/etcd-key.pem'
EOF

source ~/.bashrc

Now you can run:

etcdctl member list --write-out=table

Much cleaner.


8. Test Data Operations

etcd is a key-value store. Test it by writing and reading data:

8.1 Write a key

On cp1:

etcdctl put /test/greeting "Hello from etcd"
OK

8.2 Read the key

On cp2 (proves replication):

etcdctl get /test/greeting
/test/greeting
Hello from etcd

The data written on cp1 is immediately available on cp2.

8.3 List all keys under a prefix

etcdctl get /test --prefix --keys-only
/test/greeting

8.4 Delete the test key

etcdctl del /test/greeting
1

8.5 Preview what Kubernetes will store

When the API server connects, it will store data under the /registry/ prefix. After the cluster is running (Module 10+), you can inspect Kubernetes data directly:

# This will work after Module 10 — shown here for context
etcdctl get /registry --prefix --keys-only | head -20

You will see paths like /registry/pods/default/..., /registry/services/..., etc. — the entire cluster state.

Checkpoint: A key written on cp1 is readable on cp2. Data replication works.


9. Understand the Data Directory

etcd stores its database in /var/lib/etcd. Take a look:

sudo ls -la /var/lib/etcd/member/
drwx------ snap
drwx------ wal
DirectoryPurpose
wal/Write-ahead log — records every change before applying it. Used for crash recovery.
snap/Snapshots — periodic full copies of the database. Used for faster recovery and new member bootstrapping.

Important: Never modify files in /var/lib/etcd directly. Always use etcdctl or the etcd API.

Backup

To back up the etcd database:

etcdctl snapshot save /tmp/etcd-backup.db
Snapshot saved at /tmp/etcd-backup.db

Verify the backup:

etcdctl snapshot status /tmp/etcd-backup.db --write-out=table

This shows the snapshot revision, total key count, and size. In a production cluster, you would run this backup on a cron schedule and store the snapshots off-node.


10. Troubleshooting

context deadline exceeded when starting etcd

The other member is not reachable on port 2380. Check:

  1. The other node is started: ssh cp2 "sudo systemctl status etcd"
  2. Network connectivity: ping 192.168.56.22
  3. Port is open: ssh cp2 "ss -tlnp | grep 2380"

Start both nodes within a few seconds of each other. etcd waits up to ~30 seconds for the cluster to form.

x509: certificate is valid for X, not Y

The etcd certificate's SANs do not include the IP or hostname being used. Verify with:

openssl x509 -in /etc/etcd/etcd.pem -text -noout | grep -A 1 "Subject Alternative"

The SANs must include 192.168.56.21, 192.168.56.22, cp1, cp2, and 127.0.0.1. If any are missing, regenerate the etcd certificate in Module 07.

permission denied on data directory

The data directory must be owned by root with 700 permissions:

sudo chown -R root:root /var/lib/etcd
sudo chmod 700 /var/lib/etcd

member already bootstrapped

etcd refuses to start because the data directory has data from a previous bootstrap attempt. If you need to start fresh:

sudo systemctl stop etcd
sudo rm -rf /var/lib/etcd/*
sudo systemctl start etcd

Do this on both nodes simultaneously.

etcdctl returns Error: context deadline exceeded

The TLS flags are wrong or missing. Verify you are using --cacert, --cert, and --key with the correct file paths, or use the alias from Section 7.


11. What You Have Now

CapabilityVerification Command
etcd installed on cp1 and cp2ssh cp1 "etcd --version"
2-member cluster formedetcdctl member list --write-out=table — 2 members
Peer TLS encryptionPeer URLs use https://
Client TLS encryptionClient URLs use https://
Client certificate auth required--client-cert-auth flag in unit file
Data replication across membersWrite on cp1, read on cp2
Cluster healthyetcdctl endpoint health — both true
Backup capabilityetcdctl snapshot save /tmp/etcd-backup.db

The data store is running and healthy. The API server (Module 10) will connect to etcd using the certificates from Module 07 and store all Kubernetes cluster state here.


Next up: Module 10 — Bootstrap the Control Plane — install kube-apiserver, kube-controller-manager, and kube-scheduler on both control plane nodes.