Health Checks

Grey Matter supports the configuration of active health checks on an upstream cluster. For reference on Grey Matter's health checking, see the usage docs.

This guide will walk through how to set up HTTP health checking on an upstream cluster and enabling the health check filter on a sidecar listener.

Prerequisites

  1. An existing Grey Matter deployment running on Kubernetes (tutorial)

  2. kubectl or oc setup with access to the cluster

  3. greymatter cli setup with access to the deployment

Overview

Depending on your service and how you would like to configure it for health checks, the process may be one or two steps.

First, you will enable health checks on a cluster. The options for this configuration can be found in the cluster health check docs. Setting this field will generate a health check request from the sidecar whose domain the cluster is set on to the instances that the cluster is pointing at. The value given for path in health_checker.http_health_check will determine the endpoint that the health check request is sent to. You should configure this value to an endpoint in the service that will return a 200 response as long as the service is healthy.

This can be used as the only step in the process, or if the cluster is pointing at another sidecar in the mesh, you can configure the health check filter on the receiving sidecar to tell it how to deal with health check requests. The filter gives you options on how to determine and respond to health checks and must be set in order to simulate a failure.

The following steps show the process of setting up a cluster health check, verifying its usage, setting up the health check filter on the upstream sidecar, and simulating a health check failure.

Steps

1. Enable cluster health check

For this guide we will enable health checking from the Edge sidecar to the Grey Matter SLO service. To follow this guide on a different service, replace the mesh objects and pods with the corresponding service.

First we'll enable the health check on the upstream cluster:

greymatter edit cluster edge-to-slo-cluster

and add the following health_check object:

"health_checks": [
{
"timeout_msec": 2000,
"interval_msec": 10000,
"unhealthy_threshold": 6,
"healthy_threshold": 1,
"health_checker": {
"http_health_check": {
"path": "/objectives"
}
}
}
]

This will configure the Edge sidecar to attempt to check the health of the SLO sidecar at path /objectives every 10 seconds, with a timeout of 2 seconds. If there are 6 unhealthy responses, the SLO upstream cluster will be set as unhealthy.

We use /objectives here because the SLO service has an endpoint at that path that will return a 200 as long as the service is healthy.

2. Verify health check on cluster

Run kubectl get pods -l greymatter=edge to get the name of the Edge pod. Then exec into the Edge pod:

kubectl exec -it <edge-pod-name> -- curl localhost:8001/stats | grep health_check

You will now see a block added to the SLO cluster:

...
cluster.slo.health_check.attempt: 22
cluster.slo.health_check.degraded: 0
cluster.slo.health_check.failure: 0
cluster.slo.health_check.healthy: 1
cluster.slo.health_check.network_failure: 0
cluster.slo.health_check.passive_failure: 0
cluster.slo.health_check.success: 22
cluster.slo.health_check.verify_cluster: 0

Where you can see the health check is succeeding as healthy is 1. The values for success and failure are a count, so these stats will reflect the total number of failed and successful health checks. The value healthy indicates the status of the health check at that moment, and will determine it's value based on the healthy_threshold and unhealthy_threshold values.

Exit out of the Edge pod's shell and then check the logs of the SLO sidecar:

kubectl logs -l greymatter.io/control=slo -c sidecar

And you will see the call to /objectives at the configured interval:

"GET /objectives HTTP/1.1" 200 - 0 2 40 37 "-" "Envoy/HC" "c88fa062-345a-4e6c-9d7b-d22160d4eb30" "slo" "127.0.0.1:1337"

3. Health check filter

By default, when a health check is configured on a cluster like we did in step 1, it generates a health check request to the instances pointed at by the cluster, and determines health based on the response code. The receiving host will treat this request like any other incoming request at that path.

To configure the SLO sidecar to determine how to expect and handle any incoming health check, we need to configure the health check filter on the sidecar's ingress listener.

Options for this filter configuration are described in the filter reference docs.

greymatter edit listener listener-slo

Configure the filter by adding "envoy.health_check" to the active_http_filters list, and the configuration below to the http_filters map:

"envoy_health_check": {
"pass_through_mode": true,
"headers": [
{
"name": "user-agent",
"exact_match": "Envoy/HC"
}
]
}

The listener should look like:

{
...
"active_http_filters": [
...
"envoy.health_check"
],
"http_filters": {
...
"envoy_health_check": {
"pass_through_mode": true
}
},
"listener_key": "listener-slo",
...
}

4. Verify failure

Once this is applied, we can tell the SLO sidecar to fail its health checks.

Run kubectl get pods -l deployment=slo to get the name of the SLO pod. Then exec into the SLO pod:

kubectl exec -it <slo-pod-name> -c sidecar -- curl -XPOST localhost:8001/healthcheck/fail

This manually configures the health checks to fail, it will give response OK. If you check the SLO sidecar logs again you will now see:

"GET /objectives HTTP/1.1" 503 LH 0 0 1 - "-" "Envoy/HC" "f69b7b39-9c1a-43ba-8e33-228d0a722619" "slo" "-"

This log line returns a 503 instead of a 200 and no longer forwards the request. It knows this request is a health check from the header value for user-agent, Envoy/HC that we configured in the filter. Other requests into the SLO service will not fail because of this configuration.

Exec back into the edge pod and check the health check stats again:

kubectl exec -it <edge-pod-name> -- curl localhost:8001/stats | grep health_check

Now, in the stats you will see the value for the health_check failure is some number greater than 0, and once this number is past the original unhealthy_threshold set on the cluster, the upstream cluster will be determined unhealthy, and the stats value for healthy will be 0:

cluster.slo.health_check.failure: 11
cluster.slo.health_check.healthy: 0

To turn off the health check failures, run the below command:

kubectl exec -it <edge-pod-name> -- curl -XPOST localhost:8001/healthcheck/ok