Reboot nodes

How to properly reboot a node in an EKS Anywhere cluster

If you need to reboot a node in your cluster for maintenance or any other reason, performing the following steps will help prevent possible disruption of services on those nodes:

  1. On your admin machine, set the following environment variables that will come in handy later
export CLUSTER_NAME=mgmt
export MGMT_KUBECONFIG=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
  1. Backup cluster

    This ensures that there is an up-to-date cluster state available for restoration in the case that the cluster experiences issues or becomes unrecoverable during reboot.

  2. Verify DHCP lease time will be longer than the maintenance time, and that IPs will be the same before and after maintenance.

    This step is critical in ensuring the cluster will be healthy after reboot. If IPs are not preserved before and after reboot, the cluster may not be recoverable.

  3. Pause the reconciliation of the cluster being shut down.

    This ensures that the EKS Anywhere cluster controller will not reconcile on the nodes that are down and try to remediate them.

    • add the paused annotation to the EKSA clusters and CAPI clusters:
    kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused=true --kubeconfig=$MGMT_KUBECONFIG
    
    kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": true}}' -n eksa-system --kubeconfig=$MGMT_KUBECONFIG
    
  4. For all of the nodes in the cluster, perform the following steps in this order: worker nodes, control plane nodes, and etcd nodes.

    1. Cordon the node so no further workloads are scheduled to run on it:

      kubectl cordon <node-name>
      
    2. Drain the node of all current workloads:

      kubectl drain <node-name>
      
    3. Using the appropriate method for your provider, shut down the node.

  5. Perform system maintenance or other tasks you need to do on each node. Then boot up the node in this order: etcd nodes, control plane nodes, and worker nodes.

  6. Uncordon the nodes so that they can begin receiving workloads again.

    kubectl uncordon <node-name>
    
  7. Remove the paused annotations from EKS Anywhere cluster.

    kubectl annotate clusters.anywhere.eks.amazonaws.com $CLUSTER_NAME anywhere.eks.amazonaws.com/paused- --kubeconfig=$MGMT_KUBECONFIG
    
    kubectl patch clusters.cluster.x-k8s.io $CLUSTER_NAME --type merge -p '{"spec":{"paused": false}}' -n eksa-system --kubeconfig=$MGMT_KUBECONFIG