The Grey Matter Metrics Filter sets up a local metrics server to gather and report real-time statistics for the sidecar, microservice, and host system.
metrics version
total requests
total HTTP
total HTTPs
total RPC
total RPC/TLS
total requests
total 200
total 2xx
latency (avg)
latency (count)
latency max
latency min
latency sum
latency p50
latency p90
latency p95
latency p99
latency p9990
latency p9999
number of errors
incoming throughput
outgoing throughput
For each route that is addressed, the following stats will be computed and reported.
total requests
total 200
total 2xx
latency (avg)
latency (count)
latency max
latency min
latency sum
latency p50
latency p90
latency p95
latency p99
latency p9990
latency p9999
number of errors
incoming throughput
outgoing throughput
number of goroutines
start time
CPU percent used
CPU cores on system
os
os Architecture
memory available
memory used
memory used %
process memory used
Optionally, this filter can serve the computed statistics in a form suitable for scraping by Prometheus. The prometheus endpoint will be hosted at {METRICS_HOST}:{METRICS_PORT}{METRICS_PROMETHEUS_URI_PATH}
, which can then be scraped directly through the supported Prometheus service discovery mechanisms.
The metrics filter can also push the compiled statistics directly to AWS Cloudwatch. This allows the Grey Matter Proxy metrics to be directly used to trigger things like AutoScale actions or just for tighter monitoring directly in AWS.
Name | Type | Default | Description |
| Integer |
| Port the metrics server listens on |
| String |
| Host the metrics server listens on |
| String |
| The |
| String |
| The |
| Integer |
| ​ |
| Integer |
| Size of the cache of active metrics data |
| String | "" | Function to provide internal rollup of URL paths when reporting metrics |
| String | "1" | Truncate URLs to the first path section |
| Boolean |
| If true, metrics server uses TLS |
| String | ​ | SSL Trust file to use when serving metrics over TLS |
| String | ​ | SSL Certificate to use when serving metrics over TLS |
| String | ​ | SSL Private Key file to use when serving metrics over TLS |
| Boolean |
| If true, report metrics to AWS Cloudwatch |
| Integer | ​ | Interval to send metrics to AWS Cloudwatch |
| String | ​ | Namespace for Cloudwatch Metrics |
| String | ​ | Dimensions to report to Cloudwatch |
| String | ​ | URI paths to send metrics for |
| String | ​ | Metrics keys to send metrics for |
| Boolean | false | Verbose debugging for Cloudwatch connection |
| String | ​ | AWS region for access |
| String | ​ | AWS access key |
| String | ​ | AWS Secrete Access Key |
| String | ​ | AWS Session Token |
| String | ​ | AWS Profile to use for login |
| String | ​ | Location on disk of AWS config file |
http_filters:- name: gm.metricsconfig:metrics_port: 9080metrics_host: 0.0.0.0metrics_dashboard_uri_path: "/metrics"metrics_prometheus_uri_path: "/prometheus"metrics_ring_buffer_size: 4096use_metrics_tls: falseenable_cloudwatch: false
/metrics
{"grey-matter-metrics-version": "1.0.0","Total/requests": 22091,"HTTP/requests": 0,"HTTPS/requests": 22091,"RPC/requests": 0,"RPC_TLS/requests": 0,"route/services/catalog/1.0/summary/GET/requests": 3345,"route/services/catalog/1.0/summary/GET/routes": "","route/services/catalog/1.0/summary/GET/status/200": 3345,"route/services/catalog/1.0/summary/GET/status/2XX": 3345,"route/services/catalog/1.0/summary/GET/latency_ms.avg": 0.000000,"route/services/catalog/1.0/summary/GET/latency_ms.count": 7,"route/services/catalog/1.0/summary/GET/latency_ms.max": 0,"route/services/catalog/1.0/summary/GET/latency_ms.min": 0,"route/services/catalog/1.0/summary/GET/latency_ms.sum": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p50": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p90": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p95": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p99": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p9990": 0,"route/services/catalog/1.0/summary/GET/latency_ms.p9999": 0,"route/services/catalog/1.0/summary/GET/errors.count": 0,"route/services/catalog/1.0/summary/GET/in_throughput": 0,"route/services/catalog/1.0/summary/GET/out_throughput": 25970425,"route/services/sense/1.0/recommendation/GET/requests": 3350,"route/services/sense/1.0/recommendation/GET/routes": "","route/services/sense/1.0/recommendation/GET/status/200": 3341,"route/services/sense/1.0/recommendation/GET/status/503": 9,"route/services/sense/1.0/recommendation/GET/status/2XX": 3341,"route/services/sense/1.0/recommendation/GET/status/5XX": 9,"route/services/sense/1.0/recommendation/GET/latency_ms.avg": 0.000000,"route/services/sense/1.0/recommendation/GET/latency_ms.count": 7,"route/services/sense/1.0/recommendation/GET/latency_ms.max": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.min": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.sum": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p50": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p90": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p95": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p99": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p9990": 0,"route/services/sense/1.0/recommendation/GET/latency_ms.p9999": 0,"route/services/sense/1.0/recommendation/GET/errors.count": 0,"route/services/sense/1.0/recommendation/GET/in_throughput": 0,"route/services/sense/1.0/recommendation/GET/out_throughput": 1450994,"all/requests": 21924,"all/routes": "","all/status/304": 112,"all/status/200": 21803,"all/status/503": 9,"all/status/2XX": 21803,"all/status/5XX": 9,"all/status/3XX": 112,"all/latency_ms.avg": 0.013428,"all/latency_ms.count": 4096,"all/latency_ms.max": 13,"all/latency_ms.min": 0,"all/latency_ms.sum": 55,"all/latency_ms.p50": 0,"all/latency_ms.p90": 0,"all/latency_ms.p95": 0,"all/latency_ms.p99": 0,"all/latency_ms.p9990": 4,"all/latency_ms.p9999": 13,"all/errors.count": 0,"all/in_throughput": 132437,"all/out_throughput": 3622059,"route//GET/requests": 13,"route//GET/routes": "","route//GET/status/304": 12,"route//GET/status/200": 1,"route//GET/status/3XX": 12,"route//GET/status/2XX": 1,"route//GET/latency_ms.avg": 0.000000,"route//GET/latency_ms.count": 1,"route//GET/latency_ms.max": 0,"route//GET/latency_ms.min": 0,"route//GET/latency_ms.sum": 0,"route//GET/latency_ms.p50": 0,"route//GET/latency_ms.p90": 0,"route//GET/latency_ms.p95": 0,"route//GET/latency_ms.p99": 0,"route//GET/latency_ms.p9990": 0,"route//GET/latency_ms.p9999": 0,"route//GET/errors.count": 0,"route//GET/in_throughput": 0,"route//GET/out_throughput": 1628356,"go_metrics/runtime/num_goroutines": 6,"system/start_time": 1570507704592,"system/cpu.pct": 100.000000,"system/cpu_cores": 4,"os": "linux","os_arch": "amd64","system/memory/available": 5576384512,"system/memory/used": 10214662144,"system/memory/used_percent": 63.169011,"process/memory/used": 72286456}
/prometheus
...http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.005"} 1http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.01"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.025"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.05"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.1"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.25"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="0.5"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="1"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="2.5"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="5"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="10"} 2http_request_duration_seconds_bucket{key="all",method="",status="401",le="+Inf"} 2http_request_duration_seconds_sum{key="all",method="",status="401"} 0.01088538http_request_duration_seconds_count{key="all",method="",status="401"} 2http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.005"} 0http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.01"} 0http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.025"} 0http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.05"} 0http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.1"} 0http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.25"} 7http_request_duration_seconds_bucket{key="all",method="",status="503",le="0.5"} 9http_request_duration_seconds_bucket{key="all",method="",status="503",le="1"} 9http_request_duration_seconds_bucket{key="all",method="",status="503",le="2.5"} 9http_request_duration_seconds_bucket{key="all",method="",status="503",le="5"} 9http_request_duration_seconds_bucket{key="all",method="",status="503",le="10"} 9http_request_duration_seconds_bucket{key="all",method="",status="503",le="+Inf"} 9http_request_duration_seconds_sum{key="all",method="",status="503"} 1.9743323400000001http_request_duration_seconds_count{key="all",method="",status="503"} 9# HELP http_request_size_bytes number of bytes read from the request# TYPE http_request_size_bytes counterhttp_request_size_bytes{key="/",method="GET",status="200"} 0http_request_size_bytes{key="/",method="GET",status="304"} 0http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="200"} 0http_request_size_bytes{key="/app-icon-144x144.png",method="GET",status="304"} 0http_request_size_bytes{key="/appConfig.js",method="GET",status="304"} 0http_request_size_bytes{key="/favicon.ico",method="GET",status="200"} 0http_request_size_bytes{key="/manifest.json",method="GET",status="304"} 0http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="200"} 0http_request_size_bytes{key="/outdatedbrowser.min.css",method="GET",status="304"} 0http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="200"} 0http_request_size_bytes{key="/outdatedbrowser.min.js",method="GET",status="304"} 0http_request_size_bytes{key="/services/catalog/1.0/metrics",method="GET",status="200"} 0http_request_size_bytes{key="/services/catalog/1.0/summary",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/props",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/read",method="POST",status="200"} 1379http_request_size_bytes{key="/services/data/latest/self",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/show",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/static",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/static",method="GET",status="304"} 0http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="200"} 0http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="206"} 0http_request_size_bytes{key="/services/data/latest/stream",method="GET",status="304"} 0http_request_size_bytes{key="/services/gm-control-api/1.0/v1.0",method="GET",status="200"} 0http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="200"} 0http_request_size_bytes{key="/services/jwt/latest/policies",method="GET",status="401"} 0http_request_size_bytes{key="/services/jwt/latest/tokens",method="GET",status="307"} 0http_request_size_bytes{key="/services/kibana/1.0/api",method="GET",status="200"} 0http_response_size_bytes{key="all",method="",status="200"} 1.61519157e+08http_response_size_bytes{key="all",method="",status="206"} 8.7419618e+07http_response_size_bytes{key="all",method="",status="304"} 0http_response_size_bytes{key="all",method="",status="307"} 67http_response_size_bytes{key="all",method="",status="401"} 102http_response_size_bytes{key="all",method="",status="503"} 513# HELP non_tls_requests Number of requests not using TLS....
{"metrics_key_function": <string>,"metrics_key_depth": <string>}
See Routing.
Typically, the greater metrics_key_depth
, the finer-grained metrics you will end up with for analysis. However, there are some tradeoffs to consider.
As you see in the gm.metrics filter documentation, metrics_key_depth
will be set to 1 by default. The resulting metrics for an edge proxy would look something like this:
Note that key
field above only goes down 1 subdirectory. Does this provide enough granularity of the information? It depends.
Let's say we have following endpoints:
https://greymatter.io/apis/my-service/stores/
https://greymatter.io/apis/my-service/users/37
https://greymatter.io/apis/another-service/featured/2020/09
https://greymatter.io/apis/another-service/home.html
With metrics_key_depth
of 1, the average response time for the above routes get rolled up to one key:
/apis
If you chose metrics_key_depth
of 2, the same URLs get rolled up to two:
/apis/my-service
/apis/another-service
This would likely give you an idea of the average response time for each micro service. If URLs are structured as something like https://[domain]/[service]/
in your environment, you can get the same granularity of the information for metrics_key_depth
of 1 (i.e. key="/my-service"
and key="/another-service"
).
If you chose metrics_key_depth
of 3, the URLs in the example would get rolled up to:
/apis/my-service/stores/
/apis/my-service/users/
/apis/another-service/featured/
/apis/another-service/home.html
These look fine for these example URLs. But if URLs are structured like https://[domain]/[service]/
and my-service
has millions of users
, then you will end up with keys that look like: /my-service/users/[id]
for each and every single user IDs - which will be millions.
The motivation behind choosing the default value of 1 is to minimize the size of the data storage. As stated in Prometheus' best practices:
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
Keep in mind that this example is for the edge proxy where requests for many different microservices will flow through just the same. For this reason, the safe option would be to choose a small number for metrics_key_depth
to prevent the cardinality explosions due to a service that may get added in future.
Service sidecars can also have gm.metrics
filter. Because this is specific to a service it sits next to, we can go down a little deeper if we wanted to.
Let's take my-service
from the first example:
https://greymatter.io/apis/my-service/stores/
https://greymatter.io/apis/my-service/users/
metrics_key_depth
of 1 will give us:
/stores
/users
It is typical to have a mesh route object that will rewrite a path /apis/my-service/
to /
before forwarding the request to a side car. So even though we have a depth of 1, it still gives us timeseries data with finer-grained path.
In short, the greater the metrics_key_depth
, the faster the data storage will fill up. However, if highly rolled up "average" metrics will not give users the information they need, then there is no point in collecting them. In these scenarios, other strategies besides reducing the metrics_key_depth
value should be considered (such as data retention periods or shipping to cheaper storage).