Troubleshooting
This guide covers common errors you may experience in a Grey Matter Fabric mesh and ways to troubleshoot them.
Throughout this guide you will see calls to curl localhost:8001. This is the admin interface of every sidecar. To see a list of all available resources, run curl localhost:8001/help.
For a sidecar in Kubernetes:
1
kubectl exec -it <pod-name> -c sidecar -- curl localhost:8001/help
Copied!
For the edge sidecar in Kubernetes:
1
kubectl exec -it <pod-name> -- curl localhost:8001/help
Copied!
For Docker:
1
docker exec -it <container-id> curl localhost:8001/help
Copied!

Common Fabric Errors

A good place to start if you are seeing any of the below errors is to inspect the edge logs. This may allow you to diagnose the problem quickly.

Upstream connect error

1
upstream connect error or disconnect/reset before headers. reset reason: connection failure
Copied!
This common error will appear when a sidecar or service is not configured correctly. Upstream is where a sidecar expects a connection to be made to another sidecar or service. Requests flow into a sidecar downstream and requests to other sidecars or services flow out upstream. To diagnose the problem, you can check out inspecting the edge logs for an upstream connect error or follow these steps in order:
    1.
    verify discovery from the edge sidecar to make sure that the sidecar you are trying to reach has been properly discovered
    2.
    verify discovery from the service sidecar to make sure that the service behind the sidecar is properly discovered
    3.
    verify mTLS to see if the mTLS configuration may be causing errors
If you are still experiencing errors, after the above steps, run curl localhost:8001/config_dump from within the sidecar. This is the full configuration of the sidecar, which can be used for comparison with the mesh configuration you expect to have been applied.

No healthy upstream

1
no healthy upstream
Copied!
This error commonly occurs with a discovery issue, which is usually when a sidecar or service is not configured correctly in the mesh. To diagnose the problem, check out inspecting the edge logs for a no healthy upstream error or follow these steps in order:
    1.
    verify discovery from the edge sidecar
    2.
    verify discovery from the service sidecar

503 or 502 errors on ingress

If the browser fails completely with a 502 or a 503 error, it's likely an issue with JWT filter configuration or with SPIRE.
The best way to diagnose and troubleshoot these errors is to follow the steps in inspecting the edge logs for a 502 or 503 error.

Inspecting edge logs

A good first step, with unexpected behavior from requests in the mesh, is to check the logs of the edge sidecar. This can give you an immediate indication of the problem.
When a request is made into the mesh, it will go through the edge sidecar, so underlying services can bubble up to this level.
For Kubernetes:
1
kubectl logs <pod-name>
Copied!
For Docker:
1
docker logs <container-id>
Copied!

502 or 503 error edge logs

If you are seeing a direct 502 or 503 error in a web browser, the problem is likely with either ingress to the edge service, or a failure in the filter chain, such as the JWT filter.
Check the edge logs and grep for your path (e.g. /services/fibonacci). With this error, it is likely you won't see this request at all. If this is the case, then check for logs indicating a filter chain failure, this would appear as something like:
1
[2020-11-18 15:21:48.534][23][error][filter] [:] [filters/gm-jwt-security.go:169] gm-jwt-security filter: DecodeHeaders() - fetchToken(): Max retries reached after 1 retries: Bad response code received from call to gm-jwt-security: 503
Copied!
This could indicate a couple of things, a problem with SPIRE, a misconfigured filter, or an mTLS configuration issue from the filter to the jwt-security sidecar.
If you also see this in the logs:
1
[2020-11-18 16:17:46.897][17][warning][config] [bazel-out/k8-fastbuild/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:93] StreamSecrets gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
Copied!
The issue is with SPIRE. Make sure the SPIRE server and agents are up and running, and then restart all pods.
Otherwise, try the following to troubleshoot:
    1.
    Check that the jwt-security service and sidecar are up and running.
    3.
    If everything appears to be working in 1 and 2, follow the steps to verify discovery and verify mTLS from edge to jwt-security and in the jwt-security sidecar.

No healthy upstream edge logs

If you are seeing a no healthy upstream error, check the edge logs and grep for your path (e.g. /services/fibonacci). You may see a log that looks like this:
1
[2020-11-18T14:54:40.081Z] "GET /services/fibonacci/ HTTP/1.1" 503 UH 0 19 9 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "fec3410e-2c7e-40aa-b14f-63f8b3067957" "localhost:30000" "-"
Copied!
You can see that this is a discovery issue because the upstream host is "-", which means it is not known. If this is the case, you can follow more specific steps to verify discovery and ultimately go back through the deploy a service guide and make sure to create all of the necessary mesh objects from steps 2 and 3.

Upstream connect edge logs

If you are seeing an upstream connect error error, check the edge logs and grep for your path - (e.g. /services/fibonacci). You may see a log that looks like this:
1
[2020-11-18T15:04:17.312Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 93 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "20865026-e4da-488c-98c6-42f67da7f067" "localhost:30000" "10.42.1.8:10808"
Copied!
Because the upstream host is populated as 10.42.1.8:10808, we can see that this is not a discovery issue from edge to sidecar. To rule out discovery as the problem, next check the service's sidecar logs.
If you see a 503 log that looks like the following:
1
[2020-11-18T15:08:01.389Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 39 - "10.42.4.8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "d5295242-93f5-47e2-afd1-089556229734" "localhost:30000" "{UPSTREAM-HOST}"
Copied!
Inspect it for what is in the {UPSTREAM-HOST} value. If the value is "-", there is a discovery issue from sidecar to service. You will likely need to hard code the address for this in the local cluster. If the value is something like [::1]:9080, the instance that you configured in this step of deploying a service (the local cluster) could be wrong, or the service may have other security requirements that don't allow the sidecar to make requests to it as configured.
If you don't see any indication of the request there, the problem is likely mTLS from edge to sidecar. Follow the steps to verify mTLS.

Diagnose

Verify Discovery

To check for service discovery using the admin interface, exec into the sidecar container using whichever method is specific to your environment and run curl localhost:8001/clusters.
In the output, find the name of the service you are trying to reach. If the name of the service is there, and it is followed by an IP address, the service has been discovered. For the dashboard service as an example:
1
dashboard::10.42.2.8:10808::cx_active::7
2
dashboard::10.42.2.8:10808::cx_connect_fail::0
3
dashboard::10.42.2.8:10808::cx_total::9
4
dashboard::10.42.2.8:10808::rq_active::0
Copied!
If the name of the service you are trying to reach is not there, or if the name of the service is listed there, but it has no section containing an IP address, it is likely that the service is misconfigured in the mesh. The former could mean that the cluster does not exist for this service and the latter that it is not properly linked to this sidecar via route or shared_rules objects. Go back through the deploy a service guide and make sure to create all of the necessary mesh objects from steps 2 and 3.

Verify mTLS configuration

mTLS configuration issues can cause errors at some point along the chain of a request in the mesh. If you are sure that your mesh objects have been created and discovery is happening as expected, try the following to check if your issue is with mTLS.
If you check the edge sidecar logs and see a request made to your sidecar endpoint. For example if you have the fibonacci service at path /services/fibonacci and you see the following in the edge logs:
1
[2020-11-17T21:31:53.995Z] "GET /services/fibonacci/ HTTP/1.1" 503 UF,URX 0 91 89 - "10.42.4.5" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" "f1db7d40-6a6d-43fd-aef2-9118941e2b0d" "localhost:30000" "10.42.3.12:10808"
Copied!
Returning a 503, but you don't see any indication of this request in the sidecar's logs, this is likely an mTLS problem.
First, execute into the sidecar and try hitting the stats endpoint: curl localhost:8001/stats | grep ssl. You will see something like the following:
1
listener.0.0.0.0_10808.server_ssl_socket_factory.ssl_context_update_by_sds: 4
2
listener.0.0.0.0_10808.server_ssl_socket_factory.upstream_context_secrets_not_ready: 0
3
listener.0.0.0.0_10808.ssl.ciphers.ECDHE-ECDSA-AES128-GCM-SHA256: 1
4
listener.0.0.0.0_10808.ssl.connection_error: 0
5
listener.0.0.0.0_10808.ssl.curves.X25519: 1
6
listener.0.0.0.0_10808.ssl.fail_verify_cert_hash: 0
7
listener.0.0.0.0_10808.ssl.fail_verify_error: 0
8
listener.0.0.0.0_10808.ssl.fail_verify_no_cert: 0
9
listener.0.0.0.0_10808.ssl.fail_verify_san: 3
10
listener.0.0.0.0_10808.ssl.handshake: 1
11
listener.0.0.0.0_10808.ssl.no_certificate: 0
12
listener.0.0.0.0_10808.ssl.session_reused: 0
13
listener.0.0.0.0_10808.ssl.sigalgs.ecdsa_secp256r1_sha256: 1
14
listener.0.0.0.0_10808.ssl.versions.TLSv1.2: 1
Copied!
These stats give an indication of SSL errors on its ingress listener, and you should be able to see if there are any SSL errors.
To see the errors in the logs, you can run curl localhost:8001/logging?level=debug -XPOST. Then, make the failing request again. Because of the high volume of logs, it may be best to run curl localhost:8001/logging?level=info -XPOST immediately afterwards to turn off debug logging while you inspect the output.
Grep the logs for TLS error or OPENSSL. The errors that come up here may also indicate more specifically what the problem is.
If any of the above indicate mTLS failures, make note of the specific failures you are seeing. If you have a SPIRE enabled deployment, follow the troubleshooting SPIRE guide. If you are not running a SPIRE enabled deployment, check the SSL configuration on the domain object for your sidecar and on the cluster from edge to your sidecar.

Service Logging

The log level of any of the following core services can be retrieved and dynamically changed via the following requests:

GET /logging

Returns the current log level.
1
current log level = info
Copied!

PUT /logging?level=<log-level>

Updates the log level to the level indicated in query parameter level.
Level should be one of: error, warn, info, debug.

Other Issues

If you are still running into issues and need assistance please contact us at Grey Matter Support.
Last modified 2mo ago