This guide covers common errors you may experience in a greymatter.io mesh and ways to troubleshoot them.
Common Mesh Errors
A good place to start if you are seeing any of the below errors is to inspect the edge logs. This may allow you to diagnose the problem quickly.
Upstream connect error
upstream connect error or disconnect/reset before headers. reset reason: connection failure
This common error will appear when a sidecar or service is not configured correctly. Upstream is where a sidecar expects a connection to be made to another sidecar or service. Requests flow into a sidecar downstream and requests to other sidecars or services flow out upstream. To diagnose the problem, you can check out inspecting the edge logs for an upstream connect error or follow these steps in order:
- verify discovery from the edge sidecar to make sure that the sidecar you are trying to reach has been properly discovered
- verify discovery from the service sidecar to make sure that the service behind the sidecar is properly discovered
- verify mTLS to see if the mTLS configuration may be causing errors
If you are still experiencing errors, after the above steps, run curl localhost:8001/config_dump
from within the sidecar. This is the full configuration of the sidecar, which can be used for comparison with the mesh configuration you expect to have been applied.
No healthy upstream
no healthy upstream
This error commonly occurs with a discovery issue, which is usually when a sidecar or service is not configured correctly in the mesh. To diagnose the problem, check out inspecting the edge logs for a no healthy upstream error or follow these steps in order:
- verify discovery from the edge sidecar
- verify discovery from the service sidecar
503 or 502 errors on ingress
If the browser fails completely with a 502
or a 503
error, it’s likely an issue with JWT filter configuration or with SPIRE.
The best way to diagnose and troubleshoot these errors is to follow the steps in inspecting the edge logs for a 502 or 503 error.
Inspecting edge logs
A good first step, with unexpected behavior from requests in the mesh, is to check the logs of the edge sidecar. This can give you an immediate indication of the problem.
When a request is made into the mesh, it will go through the edge sidecar, so underlying services can bubble up to this level.
For Kubernetes:
kubectl logs <pod-name>
For Docker:
docker logs <container-id>
502 or 503 error edge logs
If you are seeing a direct 502
or 503
error in a web browser, the problem is likely with either ingress to the edge service, or a failure in the filter chain, such as the JWT filter.
Check the edge logs and grep for your path (e.g. /services/fibonacci
). With this error, it is likely you won’t see this request at all. If this is the case, then check for logs indicating a filter chain failure, this would appear as something like:
[2020-11-18 15:21:48.534][23][error][filter] [:] [filters/gm-jwt-security.go:169] gm-jwt-security filter: DecodeHeaders() - fetchToken(): Max retries reached after 1 retries: Bad response code received from call to gm-jwt-security: 503
This could indicate a couple of things, a problem with SPIRE, a misconfigured filter, or an mTLS configuration issue from the filter to the jwt-security sidecar.
If you also see this in the logs:
[2020-11-18 16:17:46.897][17][warning][config] [bazel-out/k8-fastbuild/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:93] StreamSecrets gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
The issue is with SPIRE. Make sure the SPIRE server and agents are up and running, and then restart all pods.
Otherwise, try the following to troubleshoot:
- Check that the jwt-security service and sidecar are up and running.
- Follow troubleshooting SPIRE.
- If everything appears to be working in 1 and 2, follow the steps to verify discovery and verify mTLS from edge to jwt-security and in the jwt-security sidecar.
No healthy upstream edge logs
If you are seeing a no healthy upstream
error, check the edge logs and grep for your path (e.g. /services/fibonacci
). You may see a log that looks like this:
[2020-11-18T14:54:40.081Z] "GET /services/fibonacci/ HTTP/1.1" 503 UH 0 19 9 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "fec3410e-2c7e-40aa-b14f-63f8b3067957" "localhost:30000" "-"
You can see that this is a discovery issue because the upstream host is "-"
, which means it is not known. If this is the case, you can follow more specific steps to verify discovery and ultimately go back through the deploy a service guide and make sure to create all of the necessary mesh objects from steps 2 and 3.
Upstream connect edge logs
If you are seeing an upstream connect error
error, check the edge logs and grep for your path - (e.g. /services/fibonacci
). You may see a log that looks like this:
[2020-11-18T15:04:17.312Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 93 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "20865026-e4da-488c-98c6-42f67da7f067" "localhost:30000" "10.42.1.8:10808"
Because the upstream host is populated as 10.42.1.8:10808
, we can see that this is not a discovery issue from edge to sidecar. To rule out discovery as the problem, next check the service’s sidecar logs.
If you see a 503
log that looks like the following:
[2020-11-18T15:08:01.389Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 39 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "d5295242-93f5-47e2-afd1-089556229734" "localhost:30000" "{UPSTREAM-HOST}"
Inspect it for what is in the {UPSTREAM-HOST}
value. If the value is "-"
, there is a discovery issue from sidecar to service. You will likely need to hard code the address for this in a proxy-local cluster.. If the value is something like [::1]:9080
, the instance that you configured in earlier while deploying a service (the local cluster) could be wrong, or the service may have other security requirements that don’t allow the sidecar to make requests to it as configured.
If you don’t see any indication of the request there, the problem is likely mTLS from edge to sidecar. Follow the steps to verify mTLS.
Diagnose
Verify Discovery
To check for service discovery using the admin interface, exec into the sidecar container using whichever method is specific to your environment and run curl localhost:8001/clusters
.
In the output, find the name of the service you are trying to reach. If the name of the service is there, and it is followed by an IP address, the service has been discovered. For the dashboard service as an example:
dashboard::10.42.2.8:10808::cx_active::7
dashboard::10.42.2.8:10808::cx_connect_fail::0
dashboard::10.42.2.8:10808::cx_total::9
dashboard::10.42.2.8:10808::rq_active::0
If the name of the service you are trying to reach is not there, or if the name of the service is listed there, but it has no section containing an IP address, it is likely that the service is misconfigured in the mesh. The former could mean that the cluster does not exist for this service and the latter that it is not properly linked to this sidecar via route or shared rules objects. Go back through the deploy a service guide and make sure to create all of the necessary mesh objects from steps 2 and 3.
Verify mTLS configuration
mTLS configuration issues can cause errors at some point along the chain of a request in the mesh. If you are sure that your mesh objects have been created and discovery is happening as expected, try the following to check if your issue is with mTLS.
If you check the edge sidecar logs and see a request made to your sidecar endpoint. For example if you have the fibonacci service at path /services/fibonacci
and you see the following in the edge logs:
[2020-11-17T21:31:53.995Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 89 - "10.42.4.5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "f1db7d40-6a6d-43fd-aef2-9118941e2b0d" "localhost:30000" "10.42.3.12:10808"
Returning a 503
, but you don’t see any indication of this request in the sidecar’s logs, this is likely an mTLS problem.
First, execute into the sidecar and try hitting the stats endpoint: curl localhost:8001/stats | grep ssl
. You will see something like the following:
listener.0.0.0.0_10808.server_ssl_socket_factory.ssl_context_update_by_sds: 4
listener.0.0.0.0_10808.server_ssl_socket_factory.upstream_context_secrets_not_ready: 0
listener.0.0.0.0_10808.ssl.ciphers.ECDHE-ECDSA-AES128-GCM-SHA256: 1
listener.0.0.0.0_10808.ssl.connection_error: 0
listener.0.0.0.0_10808.ssl.curves.X25519: 1
listener.0.0.0.0_10808.ssl.fail_verify_cert_hash: 0
listener.0.0.0.0_10808.ssl.fail_verify_error: 0
listener.0.0.0.0_10808.ssl.fail_verify_no_cert: 0
listener.0.0.0.0_10808.ssl.fail_verify_san: 3
listener.0.0.0.0_10808.ssl.handshake: 1
listener.0.0.0.0_10808.ssl.no_certificate: 0
listener.0.0.0.0_10808.ssl.session_reused: 0
listener.0.0.0.0_10808.ssl.sigalgs.ecdsa_secp256r1_sha256: 1
listener.0.0.0.0_10808.ssl.versions.TLSv1.2: 1
These stats give an indication of SSL errors on its ingress listener, and you should be able to see if there are any SSL errors.
To see the errors in the logs, you can run curl localhost:8001/logging?level=debug -XPOST
. Then, make the failing request again. Because of the high volume of logs, it may be best to run curl localhost:8001/logging?level=info -XPOST
immediately afterwards to turn off debug logging while you inspect the output.
Grep the logs for TLS error
or OPENSSL
. The errors that come up here may also indicate more specifically what the problem is.
If any of the above indicate mTLS failures, make note of the specific failures you are seeing. If you have a SPIRE enabled deployment, follow the troubleshooting SPIRE guide. If you are not running a SPIRE enabled deployment, check the SSL configuration on the domain object for your sidecar and on the cluster from edge to your sidecar.
GET /logging
Returns the current log level.
current log level = info
PUT /logging?level=<log-level>
Updates the log level to the level indicated in query parameter level
.
Level should be one of: error
, warn
, info
, debug
.
Missing Metrics
If metrics are missing from the greymatter web UI, inspect the configuration of the Prometheus instance that was launched by the operator.
View the current configuration with kubectl
# The default configmap name is "prometheus"
kubectl describe configmap -n greymatter prometheus
Under the scrape_configs
stanza, find the kubernetes_sd_configs
and a list of Kubernetes namespaces.
# ...elided...
# Example demonstrating several tenants (team-a, team-b, etc.)
kubernetes_sd_configs:
- role: pod
namespaces:
names: [greymatter,default,examples,team-a,team-b,team-c,team-d]
To look for metrics, Prometheus must be configured to monitor all namespaces where greymatter sidecars are running.
Provide this config via the greymatter operator by updating inputs.cue, and finally by restarting Prometheus.