Kubernetes 容器健康检查和优雅终止

2024-07-03 约 1350 字预计阅读 3 分钟

/2024/07/kubernetes-%E5%AE%B9%E5%99%A8%E5%81%A5%E5%BA%B7%E6%A3%80%E6%9F%A5%E5%92%8C%E4%BC%98%E9%9B%85%E7%BB%88%E6%AD%A2/featured-image.jpeg

在 Kubernetes 中启用容器健康检查和优雅终止，并结合应用自身特点进行配置，可以提升生产环境的应用稳定性，减少上线事故和误报。

参数

实践配置

启用 容器健康检查 和 优雅终止 的 Kubernetes Deployment 实践配置示例:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: default
  name: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      # 默认值: 30
      terminationGracePeriodSeconds: 120
      imagePullSecrets:
        - name: mysecret
      containers:
        - name: myapp
          image: registry.example.com/myapp:1.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          startupProbe:
            tcpSocket:
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 10
            # 默认值: 1 且设计目的和工作原理决定了只能设置为: 1
            successThreshold: 1
            # 默认值: 1
            timeoutSeconds: 2
          livenessProbe:
            tcpSocket:
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 3
            # 默认值: 1 且设计目的和工作原理决定了只能设置为: 1
            successThreshold: 1
            # 默认值: 1
            timeoutSeconds: 2
          readinessProbe:
            tcpSocket:
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 3
            # 默认值: 1
            successThreshold: 2
            # 默认值: 1
            timeoutSeconds: 2
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 60"]
          env:
            - name: TZ
              value: Asia/Shanghai
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 500m
              memory: 1Gi

实践配置详解

Kubernetes 默认配置:

启动检查: 无
容器上线: 最短 0 秒
容器状态:
异常判定 23-33 秒 failureThreshold(3) * timeoutSeconds(1) + ( failureThreshold(3) - 1 ) * periodSeconds(10)
恢复判定 0-10 秒 periodSeconds(10)
容器关闭: 最短 0 秒，最长 30 秒 terminationGracePeriodSeconds(30)

实践配置:

启动检查:
最短 30 秒 initialDelaySeconds(30)
最长 320 秒 initialDelaySeconds(30) + failureThreshold(10) * timeoutSeconds(2) + ( failureThreshold(10) - 1 ) * periodSeconds(30)
注意: 设计目的和工作原理决定了 startupProbe.successThreshold 只能设置为 1
容器上线:
最短 90 秒 启动检查(最短30秒) + initialDelaySeconds(30) + periodSeconds(30) * ( readinessProbe.successThreshold(2) - 1 )
注意: 设计目的和工作原理决定了 livenessProbe.successThreshold 只能设置为 1
容器状态:
异常判定 66-96 秒 failureThreshold(3) * timeoutSeconds(2) + ( failureThreshold(3) - 1 ) * periodSeconds(30)
恢复判定 30-60 秒 periodSeconds(30) * ( successThreshold(2) - 1 )
容器关闭:
最短 60 秒 sleep 60
最长 120 秒 terminationGracePeriodSeconds(120)

实践总结

与 Kubernetes 默认配置相比，以上实践配置进行了如下优化:

增加启动检查，结合应用自身特点，为容器内的应用启动提供 30-320 秒的准备时间
容器上线时间延长 90 秒，在生产上线过程中可作为适当的缓冲时间
容器状态的异常判定延长 66-96 秒，恢复判定延长 30-60 秒，可确保判定结果更加准确，避免不稳定的新容器被误判为可以正常提供服务而替换了旧的正常容器
容器关闭时间延长 60 秒，可确保仍未完成的请求有更多的时间释放连接，避免用户尚未完成的请求被异常中断

进一步优化

对于 Web 类应用，通过应用代码判断自身业务状态，生成 /healthz 健康检查页面
将基于 tcpSocket 的健康检查升级为基于 httpGet，通过获取健康检查页面的返回结果进行精准判断

优化后的配置

启用 /healthz 健康检查页面的 Kubernetes Deployment 实践配置示例:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      # 默认值: 30
      terminationGracePeriodSeconds: 120
      imagePullSecrets:
        - name: mysecret
      containers:
        - name: myapp
          image: myapp:1.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          startupProbe:
            tcpSocket:
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 10
            # 默认值: 1 且设计目的和工作原理决定了只能设置为: 1
            successThreshold: 1
            # 默认值: 1
            timeoutSeconds: 2
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 3
            # 默认值: 1 且设计目的和工作原理决定了只能设置为: 1
            successThreshold: 1
            # 默认值: 1
            timeoutSeconds: 2
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            # 默认值: 0
            initialDelaySeconds: 30
            # 默认值: 10
            periodSeconds: 30
            # 默认值: 3
            failureThreshold: 3
            # 默认值: 1
            successThreshold: 2
            # 默认值: 1
            timeoutSeconds: 2
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 60"]
          env:
            - name: TZ
              value: Asia/Shanghai
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 500m
              memory: 1Gi

参考

https://kubernetes.io/zh-cn/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

目录