Metrics

Code Blind controller exposes metrics via OpenCensus. OpenCensus is a single distribution of libraries that collect metrics and distributed traces from your services, we only use it for metrics but it will allow us to support multiple exporters in the future.

We choose to start with Prometheus as this is the most popular with Kubernetes but it is also compatible with Cloud Monitoring. If you need another exporter, check the list of supported exporters. It should be pretty straightforward to register a new one. (GitHub PRs are more than welcome.)

We plan to support multiple exporters in the future via environment variables and helm flags.

Backend integrations

Prometheus

If you are running a Prometheus instance you just need to ensure that metrics and kubernetes service discovery are enabled. (helm chart values agones.metrics.prometheusEnabled and agones.metrics.prometheusServiceDiscovery). This will automatically add annotations required by Prometheus to discover Code Blind metrics and start collecting them. (see example)

If your Prometheus metrics collection agent requires that you scrape from the pods directly(such as with Google Cloud Managed Prometheus), then the metrics ports for the controller and allocator will both be named http and exposed on 8080. In the case of the allocator, the port name and number can be overriden with the agones.allocator.serviceMetrics.http.portName and agones.allocator.serviceMetrics.http.port helm chart values.

Prometheus Operator

If you have Prometheus operator installed in your cluster, just enable ServiceMonitor installation in values:

agones:
  metrics:
    serviceMonitor:
      enabled: true

Google Cloud Managed Service for Prometheus

Google Cloud Managed Service for Prometheus is a fully managed multi-cloud solution for Prometheus. If you wish to use Managed Prometheus with Code Blind, follow the Google Cloud Managed Service for Prometheus installation steps.

Google Cloud Monitoring (formerly Stackdriver)

We support the OpenCensus Stackdriver exporter. In order to use it you should enable Cloud Monitoring API in Google Cloud Console. Follow the Google Cloud Monitoring installation steps to see your metrics in Cloud Monitoring.

Metrics available

NameDescriptionType
agones_gameservers_countThe number of gameservers per fleet and statusgauge
agones_gameserver_allocations_duration_secondsThe distribution of gameserver allocation requests latencieshistogram
agones_gameservers_totalThe total of gameservers per fleet and statuscounter
agones_gameserver_player_connected_totalThe total number of players connected to gameservers (Only available when player tracking is enabled)gauge
agones_gameserver_player_capacity_totalThe available capacity for players on gameservers (Only available when player tracking is enabled)gauge
agones_fleets_replicas_countThe number of replicas per fleet (total, desired, ready, reserved, allocated)gauge
agones_fleet_autoscalers_able_to_scaleThe fleet autoscaler can access the fleet to scalegauge
agones_fleet_autoscalers_buffer_limitsThe limits of buffer based fleet autoscalers (min, max)gauge
agones_fleet_autoscalers_buffer_sizeThe buffer size of fleet autoscalers (count or percentage)gauge
agones_fleet_autoscalers_current_replicas_countThe current replicas count as seen by autoscalersgauge
agones_fleet_autoscalers_desired_replicas_countThe desired replicas count as seen by autoscalersgauge
agones_fleet_autoscalers_limitedThe fleet autoscaler is outside the limits set by MinReplicas and MaxReplicas.gauge
agones_gameservers_node_countThe distribution of gameservers per nodehistogram
agones_nodes_countThe count of nodes empty and with gameserversgauge
agones_gameservers_state_durationThe distribution of gameserver state duration in seconds. Note: this metric could have some missing samples by design. Do not use the _total counter as the real value for state changes.histogram
agones_k8s_client_http_request_totalThe total of HTTP requests to the Kubernetes API by status codecounter
agones_k8s_client_http_request_duration_secondsThe distribution of HTTP requests latencies to the Kubernetes API by status codehistogram
agones_k8s_client_cache_list_totalThe total number of list operations for client-go cachescounter
agones_k8s_client_cache_list_duration_secondsDuration of a Kubernetes list API call in secondshistogram
agones_k8s_client_cache_list_itemsCount of items in a list from the Kubernetes APIhistogram
agones_k8s_client_cache_watches_totalThe total number of watch operations for client-go cachescounter
agones_k8s_client_cache_last_resource_versionLast resource version from the Kubernetes APIgauge
agones_k8s_client_workqueue_depthCurrent depth of the work queuegauge
agones_k8s_client_workqueue_latency_secondsHow long an item stays in the work queuehistogram
agones_k8s_client_workqueue_items_totalTotal number of items added to the work queuecounter
agones_k8s_client_workqueue_work_duration_secondsHow long processing an item from the work queue takeshistogram
agones_k8s_client_workqueue_retries_totalTotal number of items retried to the work queuecounter
agones_k8s_client_workqueue_longest_running_processorHow long the longest running workqueue processor has been running in microsecondsgauge
agones_k8s_client_workqueue_unfinished_work_secondsHow long unfinished work has been sitting in the workqueue in secondsgauge

Dropping Metric Labels

When a Fleet or FleetAutoscaler is deleted from the system, Code Blind will automatically clear metrics that utilise their name as a label from the exported metrics, so the metrics exported do not continuously grow in size over the lifecycle of the Code Blind installation.

Dashboard

Grafana Dashboards

We provide a set of useful Grafana dashboards to monitor Code Blind workload, they are located under the grafana folder:

Dashboard screenshots :

grafana dashboard autoscalers

grafana dashboard controller

Installation

When operating a live multiplayer game you will need to observe performances, resource usage and availability to learn more about your system. This guide will explain how you can setup Prometheus and Grafana into your own Kubernetes cluster to monitor your Code Blind workload.

Before attemping this guide you should make sure you have kubectl and helm installed and configured to reach your kubernetes cluster.

Prometheus installation

Prometheus is an open source monitoring solution, we will use it to store Code Blind controller metrics and query back the data.

Let’s install Prometheus using the Prometheus Community Kubernetes Helm Charts repository.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install --wait prom prometheus-community/prometheus --namespace metrics --create-namespace \
    --set server.global.scrape_interval=30s \
    --set server.persistentVolume.enabled=true \
    --set server.persistentVolume.size=64Gi \
    -f ./build/prometheus.yaml

For resiliency it is recommended to run Prometheus on a dedicated node which is separate from nodes where Game Servers are scheduled. If you use the above command, with our prometheus.yaml to set up Prometheus, it will schedule Prometheus pods on nodes tainted with agones.dev/agones-metrics=true:NoExecute and labeled with agones.dev/agones-metrics=true if available.

As an example, to set up a dedicated node pool for Prometheus on GKE, run the following command before installing Prometheus. Alternatively you can taint and label nodes manually.

gcloud container node-pools create agones-metrics --cluster=... --zone=... \
  --node-taints agones.dev/agones-metrics=true:NoExecute \
  --node-labels agones.dev/agones-metrics=true \
  --num-nodes=1 \
  --machine-type=e2-standard-4

By default we will disable the push gateway (we don’t need it for Code Blind) and other exporters.

The helm chart supports nodeSelector, affinity and toleration, you can use them to schedule Prometheus deployments on an isolated node(s) to have an homogeneous game servers workload.

This will install a Prometheus Server in your current cluster with Persistent Volume Claim (Deactivated for Minikube and Kind) for storing and querying time series, it will automatically start collecting metrics from Code Blind Controller.

Finally, to access Prometheus metrics, rules and alerts explorer use

kubectl port-forward deployments/prom-prometheus-server 9090 -n metrics

Now you can access the prometheus dashboard http://localhost:9090.

On the landing page you can start exploring metrics by creating queries. You can also verify what targets Prometheus currently monitors (Header Status > Targets), you should see Code Blind controller pod in the kubernetes-pods section.

Now let’s install some Grafana dashboards.

Grafana installation

Grafana is a open source time series analytics platform which supports Prometheus data source. We can also easily import pre-built dashboards.

First we will install Code Blind dashboard as config maps in our cluster.

kubectl apply -f ./build/grafana/

Now we can install the Grafana Community Kubernetes Helm Charts from their repository. (Replace <your-admin-password> with the admin password of your choice)

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm upgrade --install --wait grafana grafana/grafana --namespace metrics \
  --set adminPassword=<your-admin-password> -f ./build/grafana.yaml

This will install Grafana with our prepopulated dashboards and prometheus datasource previously installed

Finally to access dashboards run

kubectl port-forward deployments/grafana 3000 -n metrics

Open a web browser to http://localhost:3000, you should see Code Blind dashboards after login as admin.

Google Cloud Managed Service for Prometheus installation

To collect Code Blind metrics using Managed Prometheus:

  • Follow the instructions to enable managed collection for a GKE cluster or non-GKE cluster.

  • Configure Managed Prometheus to scrape Code Blind by creating a PodMonitoring resource:

kubectl apply -n agones-system -f https://raw.githubusercontent.com/googleforgames/agones/release-1.38.0/build/prometheus-google-managed.yaml

To install Grafana using a Managed Prometheus backend:

Google Cloud Monitoring installation

In order to use Google Cloud Monitoring you must enable the Monitoring API in the Google Cloud Console. The Cloud Monitoring exporter uses a strategy called Application Default Credentials (ADC) to find your application’s credentials. Details can be found in Setting Up Authentication for Server to Server Production Applications.

You need to grant all the necessary permissions to the users (see Access Control Guide). The predefined role Monitoring Metric Writer contains those permissions. Use the following command to assign the role to your default service account.

gcloud projects add-iam-policy-binding [PROJECT_ID] --member serviceAccount:[PROJECT_NUMBER][email protected] --role roles/monitoring.metricWriter

Before proceeding, ensure you have created a metrics node pool as mentioned in the Google Cloud installation guide.

The default metrics exporter installed with Code Blind is Prometheus. If you are using the Helm installation, you can install or upgrade Code Blind to use Cloud Monitoring, using the following chart parameters:

helm upgrade --install --wait --set agones.metrics.stackdriverEnabled=true --set agones.metrics.prometheusEnabled=false --set agones.metrics.prometheusServiceDiscovery=false my-release-name agones/agones --namespace=agones-system

With this configuration only the Cloud Monitoring exporter would be used instead of Prometheus exporter.

Using Cloud Monitoring with Workload Identity

If you would like to enable Cloud Monitoring in conjunction with Workload Identity, there are a few extra steps you need to follow:

  1. When setting up the Google service account following the instructions for Authenticating to Google Cloud, create two IAM policy bindings, one for serviceAccount:PROJECT_ID.svc.id.goog[agones-system/agones-controller] and one for serviceAccount:PROJECT_ID.svc.id.goog[agones-system/agones-allocator].

  2. Pass parameters to helm when installing Code Blind to add annotations to the agones-controller and agones-allocator Kubernetes service accounts:

helm install my-release --namespace agones-system --create-namespace agones/agones --set agones.metrics.stackdriverEnabled=true --set agones.metrics.prometheusEnabled=false --set agones.metrics.prometheusServiceDiscovery=false --set agones.serviceaccount.allocator.annotations."iam\.gke\.io/gcp-service-account"="GSA_NAME@PROJECT_ID\.iam\.gserviceaccount\.com" --set agones.serviceaccount.allocator.labels."iam\.gke\.io/gcp-service-account"="GSA_NAME@PROJECT_ID\.iam\.gserviceaccount\.com" --set agones.serviceaccount.controller.annotations."iam\.gke\.io/gcp-service-account"="GSA_NAME@PROJECT_ID\.iam\.gserviceaccount\.com"

To verify that metrics are being sent to Cloud Monitoring, create a Fleet or a Gameserver and look for the metrics to show up in the Cloud Monitoring dashboard. Navigate to the Metrics explorer and search for metrics with the prefix agones/. Select a metric and look for data to be plotted in the graph to the right.

An example of a custom dashboard is:

cloud monitoring dashboard

Currently there exists only manual way of configuring Cloud Monitoring Dashboard. So it is up to you to set an Alignment Period (minimal is 1 minute), GroupBy, Filter parameters and other graph settings.

Troubleshooting

If you can’t see Code Blind metrics you should have a look at the controller logs for connection errors. Also ensure that your cluster has the necessary credentials to interact with Cloud Monitoring. You can configure stackdriverProjectID manually, if the automatic discovery is not working.

Permissions problem example from controller logs:

Failed to export to Stackdriver: rpc error: code = PermissionDenied desc = Permission monitoring.metricDescriptors.create denied (or the resource may not exist).

If you receive this error, ensure your service account has the role or corresponding permissions mentioned above.


Last modified February 28, 2024: initial publish (7818be8)