You’ve gone and done it: you oopsie-daisy’d some flag in your API server’s yaml, applied it, and now it’s in etcd. Your API server pods are crashlooping, and very quickly your dashboard and API go down. Nothing is accessible. You are screwed.
Or Maybe Not
You still have a chance! This has been tested on CoreOS’s Tectonic 1.8+, but might work elsewhere. SSH to any master node (you are running multiple masters aren’t you?), and sudo to root. The steps: recover the API server manifest and secrets from etcd, fix the terrible mistake, start up a recovery API pod using bootkube, then fix the real API configuration. Ready?
A few things are needed for this to work:
- etcd client TLS files: CA certificate, certificate, and key
- FQDNs of your etcd nodes
The etcd TLS files can be found in the assests.zip that Tectonic makes at install time. They might also be on the master node already, trying to be used by the failing API pods. An example:
Run this docker container, fixing the paths/domains for your setup:
$ docker run -it \ -v /etc:/etc \ -v /root/recovery:/recovery \ quay.io/coreos/bootkube:v0.8.1 \ /bootkube recover \ --etcd-ca-path /path/to/etcd-client-ca.crt \ --etcd-certificate-path /path/to/etcd-client.crt \ --etcd-private-key-path /path/to/etcd-client.key \ --recovery-dir /recovery/now \ --etcd-servers=https://etcd0.com:2379,https://etcd1.com:2379,https://etcd2.com:2379 \ --kubeconfig /etc/kubernetes/kubeconfig
The expected output looks like this:
Writing asset: /recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.crt Writing asset: /recovery/now/tls/secrets/kube-apiserver/apiserver.key Writing asset: /recovery/now/tls/secrets/kube-apiserver/ca.crt Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client-ca.crt Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.crt Writing asset: /recovery/now/tls/secrets/kube-apiserver/etcd-client.key Writing asset: /recovery/now/tls/secrets/kube-apiserver/oidc-ca.crt Writing asset: /recovery/now/tls/secrets/kube-apiserver/service-account.pub Writing asset: /recovery/now/tls/secrets/kube-cloud-cfg/config Writing asset: /recovery/now/tls/secrets/kube-controller-manager/service-account.key Writing asset: /recovery/now/tls/secrets/kube-controller-manager/ca.crt Writing asset: /recovery/now/auth/kubeconfig
Now to fix everything. Edit
and back out the change that ruined the day. Save it, and copy files where they
need to go:
$ cp -r /root/recovery/now/tls /etc/kubernetes/bootstrap-secrets $ cp /root/recovery/now/bootstrap-manifests/bootstrap-kube-apiserver.yaml /etc/kubernetes/manifests
Here’s what happens. The kubelet sees the pod manifest in
/etc/kubernetes/manifests and starts it up. There’s now a pod running a single,
standalone API server. Use
kubectl or the dashboard to edit the ACTUAL API
server’s configuration and bounce the pods. Did they all come up?
Last but not least, delete the recovery pod using
kubectl or the dashboard,
and maybe back up the
/root/recovery directory in case there are more changes
Special thanks to Duffie Cooley, who might be a Kubernetes wizard.