Fixing a Broken Kubernetes API server

You’ve gone and done it: you oopsie-daisy’d some flag in your API server’s yaml, applied it, and now it’s in etcd. Your API server pods are crashlooping, and very quickly your dashboard and API go down. Nothing is accessible. You are screwed.

Or Maybe Not

You still have a chance! This has been tested on CoreOS’s Tectonic 1.8+, but might work elsewhere. SSH to any master node (you are running multiple masters aren’t you?), and sudo to root. The steps: recover the API server manifest and secrets from etcd, fix the terrible mistake, start up a recovery API pod using bootkube, then fix the real API configuration. Ready?

Recovery

A few things are needed for this to work:

  • etcd client TLS files: CA certificate, certificate, and key
  • FQDNs of your etcd nodes

The etcd TLS files can be found in the assests.zip that Tectonic makes at install time. They might also be on the master node already, trying to be used by the failing API pods. An example:

/etc/kubernetes/checkpoint-secrets/kube-system/kube-apiserver-259ll/kube-apiserver/etcd-client-ca.crt

Run this docker container, fixing the paths/domains for your setup:

$ docker run -it \
-v /etc:/etc \
-v /root/recovery:/recovery \
quay.io/coreos/bootkube:v0.8.1 \
/bootkube recover \
--etcd-ca-path /path/to/etcd-client-ca.crt \
--etcd-certificate-path /path/to/etcd-client.crt \
--etcd-private-key-path /path/to/etcd-client.key \
--recovery-dir /recovery/now \
--etcd-servers=https://etcd0.com:2379,https://etcd1.com:2379,https://etcd2.com:2379 \
--kubeconfig /etc/kubernetes/kubeconfig

The expected output looks like this:

Writing asset: /recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml
Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.key
Writing asset: /recovery/now/tls/secrets/kube-apiserver/ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client-ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.key
Writing asset: /recovery/now/tls/secrets/kube-apiserver/oidc-ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/service-account.pub
Writing asset: /recovery/now/tls/secrets/kube-cloud-cfg/config
Writing asset: /recovery/now/tls/secrets/kube-controller-manager/service-account.key
Writing asset: /recovery/now/tls/secrets/kube-controller-manager/ca.crt
Writing asset: /recovery/now/auth/kubeconfig

Repair

Now to fix everything. Edit /root/recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml and back out the change that ruined the day. Save it, and copy files where they need to go:

$ cp -r /root/recovery/now/tls /etc/kubernetes/bootstrap-secrets
$ cp /root/recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml /etc/kubernetes/manifests

Here’s what happens. The kubelet sees the pod manifest in /etc/kubernetes/manifests and starts it up. There’s now a pod running a single, standalone API server. Use kubectl or the dashboard to edit the ACTUAL API server’s configuration and bounce the pods. Did they all come up? Hooray!

Cleanup

Last but not least, delete the recovery pod using kubectl or the dashboard, and maybe back up the /root/recovery directory in case there are more changes to make.

Special Thanks

Special thanks to Duffie Cooley, who might be a Kubernetes wizard.