You’ve gone and done it: you oopsie-daisy’d some flag in your API server’s yaml, applied it, and now it’s in etcd. Your API server pods are crashlooping, and very quickly your dashboard and API go down. Nothing is accessible. You are screwed.
Or Maybe Not
You still have a chance! This has been tested on CoreOS’s Tectonic 1.8+, but might work elsewhere. SSH to any master node (you are running multiple masters aren’t you?), and sudo to root. The steps: recover the API server manifest and secrets from etcd, fix the terrible mistake, start up a recovery API pod using bootkube, then fix the real API configuration. Ready?
Recovery
A few things are needed for this to work:
- etcd client TLS files: CA certificate, certificate, and key
- FQDNs of your etcd nodes
The etcd TLS files can be found in the assests.zip that Tectonic makes at install time. They might also be on the master node already, trying to be used by the failing API pods. An example:
/etc/kubernetes/checkpoint-secrets/kube-system/kube-apiserver-259ll/kube-apiserver/etcd-client-ca.crt
Run this docker container, fixing the paths/domains for your setup:
$ docker run -it \
-v /etc:/etc \
-v /root/recovery:/recovery \
quay.io/coreos/bootkube:v0.8.1 \
/bootkube recover \
--etcd-ca-path /path/to/etcd-client-ca.crt \
--etcd-certificate-path /path/to/etcd-client.crt \
--etcd-private-key-path /path/to/etcd-client.key \
--recovery-dir /recovery/now \
--etcd-servers=https://etcd0.com:2379,https://etcd1.com:2379,https://etcd2.com:2379 \
--kubeconfig /etc/kubernetes/kubeconfig
The expected output looks like this:
Writing asset: /recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml
Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.key
Writing asset: /recovery/now/tls/secrets/kube-apiserver/ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client-ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.key
Writing asset: /recovery/now/tls/secrets/kube-apiserver/oidc-ca.crt
Writing asset: /recovery/now/tls/secrets/kube-apiserver/service-account.pub
Writing asset: /recovery/now/tls/secrets/kube-cloud-cfg/config
Writing asset: /recovery/now/tls/secrets/kube-controller-manager/service-account.key
Writing asset: /recovery/now/tls/secrets/kube-controller-manager/ca.crt
Writing asset: /recovery/now/auth/kubeconfig
Repair
Now to fix everything. Edit /root/recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml
and back out the change that ruined the day. Save it, and copy files where they
need to go:
$ cp -r /root/recovery/now/tls /etc/kubernetes/bootstrap-secrets
$ cp /root/recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml /etc/kubernetes/manifests
Here’s what happens. The kubelet sees the pod manifest in
/etc/kubernetes/manifests
and starts it up. There’s now a pod running a single,
standalone API server. Use kubectl
or the dashboard to edit the ACTUAL API
server’s configuration and bounce the pods. Did they all come up?
Hooray!
Cleanup
Last but not least, delete the recovery pod using kubectl
or the dashboard,
and maybe back up the /root/recovery
directory in case there are more changes
to make.
Special Thanks
Special thanks to Duffie Cooley, who might be a Kubernetes wizard.