Observability metrics
Rhize uses Prometheus to monitor metrics from many of its microservices. For the Kubernetes cluster, Rhize runs the Prometheus operator and monitors the accumulated metrics in Grafana dashboards. Monitoring occurs granularly, on the levels of cluster, pod, and container.
Metrics endpoints
The service metrics have endpoints at the following ports:
| Service | Available | Enabled | Port |
|---|---|---|---|
| Audit | Y | Y | 8084 |
| BAAS | Y | Y | 8080 |
| Router | Y | 9090 | |
| Tempo | Y | Y | 3100 |
Router has an available endpoint that is disabled by default.
Metrics configuration
For services where metrics are disabled by default, some configuration steps may be required. In case you are experimenting locally, this document includes for both the cluster and for Docker.
After you enable metrics, add them into the Prometheus configuration file by pointing to that service endpoint. For example:
- job_name: 'rhize-application-monitoring'
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['audit-demo.demo.svc.cluster.local:8084', 'baas-alpha.demo.svc.cluster.local:8080', 'grafana-demo.demo.svc.cluster.local:3000', 'tempo.demo.svc.cluster.local:3100', 'router-demo.demo.svc.cluster.local:9090']Router
Cluster
To enable metrics, the Router Helm chart needs to have several options added or changed, as follows: For details, refer to the Official Apollo instructions.
router:
configuration:
# -- Other configuration prior
telemetry:
metrics:
prometheus:
enabled: true
listen: 0.0.0.0:9090
path: "/metrics"
# -- Open container ports
containerPorts:
metrics: 9090
# -- Enable service monitor
serviceMonitor:
enabled: trueYou can connect to this endpoint through the router pod. For example:
router-demo.demo.svc.cluster.local:9090
Docker
To enable Router metrics, modify its configuration file. For details, refer to the Official Apollo Instructions.
The following is an example configuration:
# -- Other configuration prior
telemetry:
exporters:
metrics:
prometheus:
enabled: true
listen: 0.0.0.0:9090
path: /metricsThis opens the metrics endpoint on port 9090.
To view it externally, you must expose the port in docker-compose.
Once the port is exposed, view the metrics at localhost:9090/metrics
Available Rhize microservice metrics
Several common metrics appear between Rhize microservices:
goprocesshttp.
| Service | Instrumented Prometheus Go Application | process metrics | HTTP metrics* | Additional |
|---|---|---|---|---|
| Audit | Y | Y | Y | |
| BAAS | Y | Y | Y | |
| Router | Y | |||
| Tempo | Y | Y |
promhttp.Additional metrics on BAAS are from dgraph. These include two categories: Badger and Dgraph.
Example:
# HELP badger_disk_reads_total Number of cumulative reads by Badger
# TYPE badger_disk_reads_total untyped
badger_disk_reads_total 0
# HELP badger_disk_writes_total Number of cumulative writes by Badger
# TYPE badger_disk_writes_total untyped
badger_disk_writes_total 0
# HELP badger_gets_total Total number of gets
# TYPE badger_gets_total untyped
badger_gets_total 0# HELP dgraph_alpha_health_status Status of the alphas
# TYPE dgraph_alpha_health_status gauge
dgraph_alpha_health_status 1
# HELP dgraph_disk_free_bytes Total number of bytes free on disk
# TYPE dgraph_disk_free_bytes gauge
dgraph_disk_free_bytes{dir="postings_fs"} 1.0153562112e+10
# HELP dgraph_disk_total_bytes Total number of bytes on disk
# TYPE dgraph_disk_total_bytes gauge
dgraph_disk_total_bytes{dir="postings_fs"} 1.0447245312e+10Router
All metrics provided by Router are unique to Apollo Router.
Example:
# HELP apollo_router_cache_hit_count apollo_router_cache_hit_count
# TYPE apollo_router_cache_hit_count counter
apollo_router_cache_hit_count{kind="query planner",service_name="router-demo",storage="memory"} 121802
# HELP apollo_router_cache_hit_time apollo_router_cache_hit_time
# TYPE apollo_router_cache_hit_time histogram
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.001"} 121802
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.005"} 121802
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.015"} 121802Tempo
Tempo has a few categories of metrics:
jaegerprometheustempotempodb.
The Tempo documentation details what these metrics measure.
Example:
# HELP jaeger_tracer_baggage_restrictions_updates_total Number of times baggage restrictions were successfully updated
# TYPE jaeger_tracer_baggage_restrictions_updates_total counter
jaeger_tracer_baggage_restrictions_updates_total{result="err"} 0
jaeger_tracer_baggage_restrictions_updates_total{result="ok"} 0
# HELP jaeger_tracer_baggage_truncations_total Number of times baggage was truncated as per baggage restrictions
# TYPE jaeger_tracer_baggage_truncations_total counter
jaeger_tracer_baggage_truncations_total 0# HELP prometheus_remote_storage_exemplars_in_total Exemplars in to remote storage, compare to exemplars out for queue managers.
# TYPE prometheus_remote_storage_exemplars_in_total counter
prometheus_remote_storage_exemplars_in_total 0
# HELP prometheus_remote_storage_histograms_in_total HistogramSamples in to remote storage, compare to histograms out for queue managers.
# TYPE prometheus_remote_storage_histograms_in_total counter
prometheus_remote_storage_histograms_in_total 0
# HELP prometheus_remote_storage_samples_in_total Samples in to remote storage, compare to samples out for queue managers.
# TYPE prometheus_remote_storage_samples_in_total counter
prometheus_remote_storage_samples_in_total 0# HELP tempo_distributor_ingester_clients The current number of ingester clients.
# TYPE tempo_distributor_ingester_clients gauge
tempo_distributor_ingester_clients 0
# HELP tempo_distributor_metrics_generator_clients The current number of metrics-generator clients.
# TYPE tempo_distributor_metrics_generator_clients gauge
tempo_distributor_metrics_generator_clients 0
# HELP tempo_distributor_push_duration_seconds Records the amount of time to push a batch to the ingester.
# TYPE tempo_distributor_push_duration_seconds histogram
tempo_distributor_push_duration_seconds_bucket{le="0.005"} 0
tempo_distributor_push_duration_seconds_bucket{le="0.01"} 0# HELP tempodb_backend_hedged_roundtrips_total Total number of hedged backend requests. Registered as a gauge for code sanity. This is a counter.
# TYPE tempodb_backend_hedged_roundtrips_total gauge
tempodb_backend_hedged_roundtrips_total 0
# HELP tempodb_blocklist_poll_duration_seconds Records the amount of time to poll and update the blocklist.
# TYPE tempodb_blocklist_poll_duration_seconds histogram
tempodb_blocklist_poll_duration_seconds_bucket{le="0"} 0
tempodb_blocklist_poll_duration_seconds_bucket{le="60"} 2012
tempodb_blocklist_poll_duration_seconds_bucket{le="120"} 2012Dashboards
A number of Grafana dashboards are pre-configured for use with Prometheus metrics. All dashboards in Grafana use Prometheus as a data source.
You can download them from Rhize Dashboard templates.