健康检查端点 /healthz 设计实践

目录
设计一个具有关键检查和独立外部监控的健康检查端点。
技巧
“Talk is cheap. Show me the code.” - Linus Torvalds
代码分享: https://github.com/mcsrainbow/mock-healthz-metrics
✨ 特性
/healthz
:- 返回码: 健康
200
, 异常500
- 输出:
纯文本
,JSON
- 返回码: 健康
/metrics
: Prometheus 指标- 两级健康检查:
- 🔴 关键检查 影响整体健康状态:
- 🔌 数据库连接: 核心依赖,必须健康
- ⚙️ 配置服务: 核心依赖,必须健康
- 🔁 内部 API
billing
,usage
: 依赖上游服务DB + Config
,上游失败时跳过
- 🟡 外部检查 独立:
- 🌍 外部 API
alipay
,sms
: 独立运行,不影响整体健康状态
- 🌍 外部 API
- 🔴 关键检查 影响整体健康状态:
- 依赖项具有内置错误概率和超时模拟
- 完全兼容 Kubernetes 探针和 Prometheus 抓取
- 依赖链:
- 数据库 + 配置服务 → 内部 API → 整体健康状态
- 外部 API → 仅独立监控
🚀 开始使用
1. 安装依赖
pip install bottle
2. 运行脚本
python mock-healthz-metrics.py
Service endpoints available:
Healthcheck (Text): http://0.0.0.0:8080/healthz
Healthcheck (JSON): http://0.0.0.0:8080/healthz?format=json
Prometheus metrics: http://0.0.0.0:8080/metrics
✅ 健康检查
纯文本
http://127.0.0.1:8080/healthz
http://127.0.0.1:8080/healthz?format=text
健康状态
CHECK STATUS MESSAGE
----- Critical -----
db_connection PASS Database is connected
config_service PASS Config service is reachable
internal_api/billing PASS internal_api/billing OK (392ms)
internal_api/usage PASS internal_api/usage OK (348ms)
----- External -----
external_api/alipay PASS external_api/alipay OK (308ms)
external_api/sms FAIL external_api/sms timed out
异常状态
CHECK STATUS MESSAGE
----- Critical -----
db_connection PASS Database is connected
config_service PASS Config service is reachable
internal_api/billing PASS internal_api/billing OK (253ms)
internal_api/usage FAIL internal_api/usage returned error
----- External -----
external_api/alipay PASS external_api/alipay OK (101ms)
external_api/sms PASS external_api/sms OK (183ms)
JSON 格式
http://127.0.0.1:8080/healthz?format=json
健康状态
{
"status": "ok",
"data": {
"message": "All critical checks passed",
"checks": {
"critical": [
{
"name": "db_connection",
"status": "ok",
"message": "Database is connected"
},
{
"name": "config_service",
"status": "ok",
"message": "Config service is reachable"
},
{
"name": "internal_api/billing",
"status": "ok",
"message": "internal_api/billing OK (392ms)"
},
{
"name": "internal_api/usage",
"status": "ok",
"message": "internal_api/usage OK (348ms)"
}
],
"external": [
{
"name": "external_api/alipay",
"status": "ok",
"message": "external_api/alipay OK (308ms)"
},
{
"name": "external_api/sms",
"status": "error",
"message": "external_api/sms timed out"
}
]
}
}
}
异常状态
{
"status": "error",
"data": {
"message": "Critical checks failed",
"checks": {
"critical": [
{
"name": "db_connection",
"status": "ok",
"message": "Database is connected"
},
{
"name": "config_service",
"status": "ok",
"message": "Config service is reachable"
},
{
"name": "internal_api/billing",
"status": "ok",
"message": "internal_api/billing OK (253ms)"
},
{
"name": "internal_api/usage",
"status": "error",
"message": "internal_api/usage returned error"
}
],
"external": [
{
"name": "external_api/alipay",
"status": "ok",
"message": "external_api/alipay OK (101ms)"
},
{
"name": "external_api/sms",
"status": "ok",
"message": "external_api/sms OK (183ms)"
}
]
}
}
}
📈 Prometheus
http://127.0.0.1:8080/metrics
配置
scrape_configs:
- job_name: 'mock-healthz-metrics'
static_configs:
- targets: ['127.0.0.1:8080']
默认抓取 http://127.0.0.1:8080/metrics
链接。
指标
健康状态
# HELP healthcheck_status Health check status (1=ok,0=error)
# TYPE healthcheck_status gauge
healthcheck_status{check="db_connection",type="critical"} 1
healthcheck_status{check="config_service",type="critical"} 1
healthcheck_status{check="internal_api/billing",type="critical"} 1
healthcheck_status{check="internal_api/usage",type="critical"} 1
healthcheck_status{check="external_api/alipay",type="external"} 1
healthcheck_status{check="external_api/sms",type="external"} 0
异常状态
# HELP healthcheck_status Health check status (1=ok,0=error)
# TYPE healthcheck_status gauge
healthcheck_status{check="db_connection",type="critical"} 1
healthcheck_status{check="config_service",type="critical"} 1
healthcheck_status{check="internal_api/billing",type="critical"} 0
healthcheck_status{check="internal_api/usage",type="critical"} 1
healthcheck_status{check="external_api/alipay",type="external"} 1
healthcheck_status{check="external_api/sms",type="external"} 1
🐳 Kubernetes
对于 Kubernetes livenessProbe 和 readinessProbe:
❯ curl -I http://127.0.0.1:8080/healthz
HTTP/1.0 200 OK
Date: Fri, 18 Jul 2025 03:43:31 GMT
Server: WSGIServer/0.2 CPython/3.11.11
Content-Type: text/plain
Content-Length: 444
❯ curl -I http://127.0.0.1:8080/healthz
HTTP/1.0 500 Internal Server Error
Date: Fri, 18 Jul 2025 03:43:36 GMT
Server: WSGIServer/0.2 CPython/3.11.11
Content-Type: text/plain
Content-Length: 438
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 20
failureThreshold: 3
successThreshold: 2
timeoutSeconds: 2
关于更详细的探针设计,请参考文章: Kubernetes 容器健康检查和优雅终止。
📌 注意事项
- 此脚本使用随机化逻辑来模拟服务不稳定性
- 对于生产环境使用,请用真实的健康检查实现替换模拟逻辑(例如,数据库 ping、HTTP 客户端调用等)