Observability metrics

Rhize uses Prometheus to monitor metrics from many of its microservices. For the Kubernetes cluster, Rhize runs the Prometheus operator and monitors the accumulated metrics in Grafana dashboards. Monitoring occurs granularly, on the levels of cluster, pod, and container.

Metrics endpoints

The service metrics have endpoints at the following ports:

Service	Available	Enabled	Port
Audit	Y	Y	8084
BAAS	Y	Y	8080
Router	Y		9090
Tempo	Y	Y	3100

Router has an available endpoint that is disabled by default.

Metrics configuration

For services where metrics are disabled by default, some configuration steps may be required. In case you are experimenting locally, this document includes for both the cluster and for Docker.

After you enable metrics, add them into the Prometheus configuration file by pointing to that service endpoint. For example:

- job_name: 'rhize-application-monitoring'
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
    - targets: ['audit-demo.demo.svc.cluster.local:8084', 'baas-alpha.demo.svc.cluster.local:8080', 'grafana-demo.demo.svc.cluster.local:3000', 'tempo.demo.svc.cluster.local:3100', 'router-demo.demo.svc.cluster.local:9090']

Router

Cluster

To enable metrics, the Router Helm chart needs to have several options added or changed, as follows: For details, refer to the Official Apollo instructions.

router:
  configuration:
	  # -- Other configuration prior
    telemetry:
      metrics:
        prometheus:
          enabled: true
          listen: 0.0.0.0:9090
          path: "/metrics"

# -- Open container ports
containerPorts:
  metrics: 9090

# -- Enable service monitor
serviceMonitor:
  enabled: true

You can connect to this endpoint through the router pod. For example:

router-demo.demo.svc.cluster.local:9090

Docker

To enable Router metrics, modify its configuration file. For details, refer to the Official Apollo Instructions.

The following is an example configuration:

# -- Other configuration prior
telemetry:
  exporters:
     metrics:
       prometheus:
         enabled: true
         listen: 0.0.0.0:9090
         path: /metrics

This opens the metrics endpoint on port 9090. To view it externally, you must expose the port in docker-compose. Once the port is exposed, view the metrics at localhost:9090/metrics

Available Rhize microservice metrics

Several common metrics appear between Rhize microservices:

go
process
http.

Service	Instrumented Prometheus Go Application	process metrics	HTTP metrics*	Additional
Audit	Y	Y	Y
BAAS	Y	Y		Y
Router				Y
Tempo	Y			Y

HTTP metrics are noted as promhttp.

Additional metrics on BAAS are from dgraph. These include two categories: Badger and Dgraph.

Example:

# HELP badger_disk_reads_total Number of cumulative reads by Badger
# TYPE badger_disk_reads_total untyped
badger_disk_reads_total 0

# HELP badger_disk_writes_total Number of cumulative writes by Badger
# TYPE badger_disk_writes_total untyped
badger_disk_writes_total 0

# HELP badger_gets_total Total number of gets
# TYPE badger_gets_total untyped
badger_gets_total 0

# HELP dgraph_alpha_health_status Status of the alphas
# TYPE dgraph_alpha_health_status gauge
dgraph_alpha_health_status 1

# HELP dgraph_disk_free_bytes Total number of bytes free on disk
# TYPE dgraph_disk_free_bytes gauge
dgraph_disk_free_bytes{dir="postings_fs"} 1.0153562112e+10

# HELP dgraph_disk_total_bytes Total number of bytes on disk
# TYPE dgraph_disk_total_bytes gauge
dgraph_disk_total_bytes{dir="postings_fs"} 1.0447245312e+10

Router

All metrics provided by Router are unique to Apollo Router.

Example:

# HELP apollo_router_cache_hit_count apollo_router_cache_hit_count
# TYPE apollo_router_cache_hit_count counter
apollo_router_cache_hit_count{kind="query planner",service_name="router-demo",storage="memory"} 121802

# HELP apollo_router_cache_hit_time apollo_router_cache_hit_time
# TYPE apollo_router_cache_hit_time histogram
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.001"} 121802
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.005"} 121802
apollo_router_cache_hit_time_bucket{kind="query planner",service_name="router-demo",storage="memory",le="0.015"} 121802

Tempo

Tempo has a few categories of metrics:

jaeger
prometheus
tempo
tempodb.

The Tempo documentation details what these metrics measure.

Example:

# HELP jaeger_tracer_baggage_restrictions_updates_total Number of times baggage restrictions were successfully updated
# TYPE jaeger_tracer_baggage_restrictions_updates_total counter
jaeger_tracer_baggage_restrictions_updates_total{result="err"} 0
jaeger_tracer_baggage_restrictions_updates_total{result="ok"} 0

# HELP jaeger_tracer_baggage_truncations_total Number of times baggage was truncated as per baggage restrictions
# TYPE jaeger_tracer_baggage_truncations_total counter
jaeger_tracer_baggage_truncations_total 0

# HELP prometheus_remote_storage_exemplars_in_total Exemplars in to remote storage, compare to exemplars out for queue managers.
# TYPE prometheus_remote_storage_exemplars_in_total counter
prometheus_remote_storage_exemplars_in_total 0

# HELP prometheus_remote_storage_histograms_in_total HistogramSamples in to remote storage, compare to histograms out for queue managers.
# TYPE prometheus_remote_storage_histograms_in_total counter
prometheus_remote_storage_histograms_in_total 0

# HELP prometheus_remote_storage_samples_in_total Samples in to remote storage, compare to samples out for queue managers.
# TYPE prometheus_remote_storage_samples_in_total counter
prometheus_remote_storage_samples_in_total 0

# HELP tempo_distributor_ingester_clients The current number of ingester clients.
# TYPE tempo_distributor_ingester_clients gauge
tempo_distributor_ingester_clients 0

# HELP tempo_distributor_metrics_generator_clients The current number of metrics-generator clients.
# TYPE tempo_distributor_metrics_generator_clients gauge
tempo_distributor_metrics_generator_clients 0

# HELP tempo_distributor_push_duration_seconds Records the amount of time to push a batch to the ingester.
# TYPE tempo_distributor_push_duration_seconds histogram
tempo_distributor_push_duration_seconds_bucket{le="0.005"} 0
tempo_distributor_push_duration_seconds_bucket{le="0.01"} 0

# HELP tempodb_backend_hedged_roundtrips_total Total number of hedged backend requests. Registered as a gauge for code sanity. This is a counter.
# TYPE tempodb_backend_hedged_roundtrips_total gauge
tempodb_backend_hedged_roundtrips_total 0

# HELP tempodb_blocklist_poll_duration_seconds Records the amount of time to poll and update the blocklist.
# TYPE tempodb_blocklist_poll_duration_seconds histogram
tempodb_blocklist_poll_duration_seconds_bucket{le="0"} 0
tempodb_blocklist_poll_duration_seconds_bucket{le="60"} 2012
tempodb_blocklist_poll_duration_seconds_bucket{le="120"} 2012

Dashboards

A number of Grafana dashboards are pre-configured for use with Prometheus metrics. All dashboards in Grafana use Prometheus as a data source.

You can download them from Rhize Dashboard templates.

GraphQL types and filters