1 - Scheduling and Autoscaling

Scheduling and autoscaling go hand in hand, as where in the cluster GameServers are provisioned impacts how to autoscale fleets up and down (or if you would even want to)

Cluster Autoscaler

Kubernetes has a cluster node autoscaler that works with a wide variety of cloud providers.

The default scheduling strategy (Packed) is designed to work with the Kubernetes autoscaler out of the box.

The autoscaler will automatically add Nodes to the cluster when GameServers don’t have room to be scheduled on the clusters, and then scale down when there are empty Nodes with no GameServers running on them.

This means that scaling Fleets up and down can be used to control the size of the cluster, as the cluster autoscaler will adjust the size of the cluster to match the resource needs of one or more Fleets running on it.

To enable and configure autoscaling on your cloud provider, check their connector implementation, or their cloud specific documentation.

Google Kubernetes Engine

Amazon Elastic Kubernetes Service

Azure Kubernetes Service

Fleet Autoscaling

Fleet autoscaling is the only type of autoscaling that exists in Code Blind. It is currently available as a buffer autoscaling strategy or as a webhook driven strategy, such that you can provide your own autoscaling logic.

Have a look at the Create a Fleet Autoscaler quickstart, the Create a Webhook Fleet Autoscaler quickstart, and the Fleet Autoscaler Specification for details.

Autoscaling Concepts

To facilitate autoscaling, we need to combine several concepts and functionality, as described below.

Allocation Scheduling

Allocation scheduling refers to the order in which GameServers, and specifically their backing Pods are chosen from across the Kubernetes cluster within a given Fleet when allocation occurs.

Pod Scheduling

Each GameServer is backed by a Kubernetes Pod. Pod scheduling refers to the strategy that is in place that determines which node in the Kubernetes cluster the Pod is assigned to, when it is created.

Fleet Scale Down Strategy

Fleet Scale Down strategy refers to the order in which the GameServers that belong to a Fleet are deleted, when Fleets are shrunk in size.

Fleet Scheduling

There are two scheduling strategies for Fleets - each designed for different types of Kubernetes Environments.

Packed

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server
spec:
  replicas: 100
  scheduling: Packed
  template:
    spec:
      ports:
      - containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-game-server
            image: us-docker.pkg.dev/codeblind/examples/simple-server:0.27

This is the default Fleet scheduling strategy. It is designed for dynamic Kubernetes environments, wherein you wish to scale up and down as load increases or decreases, such as in a Cloud environment where you are paying for the infrastructure you use.

It attempts to pack as much as possible into the smallest set of nodes, to make scaling infrastructure down as easy as possible.

This affects the Cluster autoscaler, Allocation Scheduling, Pod Scheduling and Fleet Scale Down Scheduling.

Cluster Autoscaler

When using the “Packed” strategy, Code Blind will ensure that the Cluster Autoscaler doesn’t attempt to evict and move GameServer Pods onto new Nodes during gameplay.

If a gameserver can tolerate being evicted (generally in combination with setting an appropriate graceful termination period on the gameserver pod) and you want the Cluster Autoscaler to compact your cluster by evicting game servers when it would allow the Cluster Autoscaler to reduce the number of nodes in the cluster, Controlling Disruption describes how to choose the .eviction setting appropriate for your GameServer or Fleet.

Allocation Scheduling Strategy

Under the “Packed” strategy, allocation will prioritise allocating GameServers to nodes that are running on Nodes that already have allocated GameServers running on them.

Pod Scheduling Strategy

Under the “Packed” strategy, Pods will be scheduled using the PodAffinity with a preferredDuringSchedulingIgnoredDuringExecution affinity with hostname topology. This attempts to group together GameServer Pods within as few nodes in the cluster as it can.

Fleet Scale Down Strategy

With the “Packed” strategy, Fleets will remove Ready GameServers from Nodes with the least number of Ready and Allocated GameServers on them. Attempting to empty Nodes so that they can be safely removed.

Distributed

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server
spec:
  replicas: 100
  scheduling: Distributed
  template:
    spec:
      ports:
      - containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-game-server
            image: us-docker.pkg.dev/codeblind/examples/simple-server:0.27

This Fleet scheduling strategy is designed for static Kubernetes environments, such as when you are running Kubernetes on bare metal, and the cluster size rarely changes, if at all.

This attempts to distribute the load across the entire cluster as much as possible, to take advantage of the static size of the cluster.

This affects Allocation Scheduling, Pod Scheduling and Fleet Scale Down Scheduling.

Cluster Autoscaler

Since this strategy is not aimed at clusters that autoscale, this strategy does nothing for the cluster autoscaler.

Allocation Scheduling Strategy

Under the “Distributed” strategy, allocation will prioritise allocating GameServers to nodes that have the least number of allocated GameServers on them.

Pod Scheduling Strategy

Under the “Distributed” strategy, Pod scheduling is provided by the default Kubernetes scheduler, which will attempt to distribute the GameServer Pods across as many nodes as possible.

Fleet Scale Down Strategy

With the “Distributed” strategy, Fleets will remove Ready GameServers from Nodes with at random, to ensure a distributed load is maintained.

2 - High Availability Code Blind

Learn how to configure your Code Blind services for high availability and resiliancy to disruptions.

High Availability for Code Blind Controller

The agones-controller responsibility is split up into agones-controller, which enacts the Code Blind control loop, and agones-extensions, which acts as a service endpoint for webhooks and the allocation extension API. Splitting these responsibilities allows the agones-extensions pod to be horizontally scaled, making the Code Blind control plane highly available and more resiliant to disruption.

Multiple agones-controller pods enabled, with a primary controller selected via leader election. Having multiple agones-controller minimizes downtime of the service from pod disruptions such as deployment updates, autoscaler evictions, and crashes.

Extension Pod Configrations

The agones-extensions binary has a similar helm configuration to agones-controller, see here. If you previously overrode agones.controller.* settings, you may need to override the same agones.extensions.* setting.

To change controller.numWorkers to 200 from 100 values and through the use of helm --set, add the follow to the helm command:

 ...
 --set agones.controller.numWorkers=200
 ...

An important configuration to note is the PodDisruptionBudget fields, agones.extensions.pdb.minAvailable and agones.extensions.pdb.maxUnavailable. Currently, the agones.extensions.pdb.minAvailable field is set to 1.

Deployment Considerations

Leader election will automatically be enabled and agones.controller.replicas is > 1. agones.controller.replicas defaults to 2.

The default configuration now deploys 2 agones-controller pods and 2 agones-extensions pods, replacing the previous single agones-controller pod setup. For example:

NAME                                 READY   STATUS    RESTARTS   AGE
agones-allocator-78c6b8c79-h9nqc     1/1     Running   0          23h
agones-allocator-78c6b8c79-l2bzp     1/1     Running   0          23h
agones-allocator-78c6b8c79-rw75j     1/1     Running   0          23h
agones-controller-fbf944f4-vs9xx     1/1     Running   0          23h
agones-controller-fbf944f4-sjk3t     1/1     Running   0          23h
agones-extensions-5648fc7dcf-hm6lk   1/1     Running   0          23h
agones-extensions-5648fc7dcf-qbc6h   1/1     Running   0          23h
agones-ping-5b9647874-2rrl6          1/1     Running   0          27h
agones-ping-5b9647874-rksgg          1/1     Running   0          27h

The number of replicas for agones-extensions can be set using helm variable agones.extensions.replicas, but the default is 2.

We expect the aggregate memory consumption of the pods will be slightly higher than the previous singleton pod, but as the responsibilities are now split across the pods, the aggregate CPU consumption should also be similar.

Feature Design

Please see HA Code Blind.

3 - Controlling Disruption

Game servers running on Code Blind may be disrupted by Kubernetes; learn how to control disruption of your game servers.

Disruption in Kubernetes

A Pod in Kubernetes may be disrupted for involuntary reasons, e.g. hardware failure, or voluntary reasons, such as when nodes are drained for upgrades.

By default, Code Blind assumes your game server should never be disrupted voluntarily and configures the Pod appropriately - but this isn’t always the ideal setting. Here we discuss how Code Blind allows you to control the two most significant sources of voluntary Pod evictions, node upgrades and Cluster Autoscaler, using the eviction API on the GameServer object.

Benefits of Allowing Voluntary Disruption

It’s not always easy to write your game server in a way that allows for disruption, but it can have major benefits:

  • Compaction of your cluster using Cluster Autoscaler can lead to considerable cost savings for your infrastructure.
  • Allowing automated node upgrades can save you management toil, and lowers the time it takes to patch security vulnerabilites.

Considerations

When discussing game server pod disruption, it’s important to keep two factors in mind:

  • TERM signal: Is your game server tolerant of graceful termination? If you wish to support voluntary disruption, your game server must handle the TERM signal (even if it runs to completion after receiving TERM).
  • Termination Grace Period: After receiving TERM, how long does your game server need to run? If you run to completion after receiving TERM, this is equivalent to the session length - if not, you can think of this as the cleanup time. In general, we bucket the grace period into “less than 10 minutes”, “10 minutes to an hour”, and “greater than an hour”. (See below if you are curious about grace period considerations.)

eviction API

The eviction API is specified as part of the GameServerSpec, like:

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "simple-game-server"
spec:
  eviction:
    safe: Always
  template:
    [...]

You can set eviction.safe based on your game server’s tolerance for disruption and session length, based on the following diagram:

Eviction Decision Diagram

In words:

  • Does the game server support TERM and terminate within ten minutes?
    • Yes to both: Set safe: Always, and set terminationGracePeriodSeconds to the session length or cleanup time.
    • No to either: Does the game server support TERM and terminate within an hour?
      • Yes to both: Set safe: OnUpgrade, and configure terminationGracePeriodSeconds to the session length or cleanup time.
      • No to either: Set safe: Never. If your game server does not terminate within an hour, see below.

What’s special about ten minutes and one hour?

  • Ten minutes: Cluster Autoscaler respects ten minutes of graceful termination on scale-down. On some cloud products, you can configure --max-graceful-termination-sec to change this, but it is not advised: Cluster Autoscaler is currently only capable of scaling down one node at a time, and larger graceful termination windows slow this down farther (see autoscaler#5079). If the ten minute limit does not apply to you, generally you should choose between safe: Always (for sessions less than an hour), or see below.

  • One hour: On many cloud products, PodDisruptionBudget can only block node upgrade evictions for a certain period of time - on GKE this is 1h. After that, the PDB is ignored, or the node upgrade fails with an error. Controlling Pod disruption for longer than one hour requires cluster configuration changes outside of Code Blind - see below.

Considerations for long sessions

Outside of Cluster Autoscaler, the main source of disruption for long sessions is node upgrade. On some cloud products, such as GKE Standard, node upgrades are entirely within your control. On others, such as GKE Autopilot, node upgrade is automatic. Typical node upgrades use an eviction based, rolling recreate strategy, and may not honor PodDisruptionBudget for longer than an hour. See Best Practices for information specific to your cloud product.

Implementation / Under the hood

Each option uses a slightly different permutation of:

As a quick reference:

evictions.safe settingsafe-to-evict pod annotationagones.dev/safe-to-evict label
Never (default)falsefalse (matches PDB)
OnUpdatefalsetrue (does not match PDB)
Alwaystruetrue (does not match PDB)

Further Reading

4 - Limiting CPU & Memory

Kubernetes natively has inbuilt capabilities for requesting and limiting both CPU and Memory usage of running containers.

As a short description:

  • CPU Requests are limits that are applied when there is CPU congestion, and as such can burst above their set limits.
  • CPU Limits are hard limits on how much CPU time the particular container gets access to.

This is useful for game servers, not just as a mechanism to distribute compute resources evenly, but also as a way to advice the Kubernetes scheduler how many game server processes it is able to fit into a given node in the cluster.

It’s worth reading the Managing Compute Resources for Containers Kubernetes documentation for more details on “requests” and “limits” to both CPU and Memory, and how to configure them.

GameServers

Since the GameServer specification provides a full PodSpecTemplate, we can take advantage of both resource limits and requests in our GameServer configurations.

For example, to set a CPU limit on our GameServer configuration of 250m/0.25 of a CPU, we could do so as followed:

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  name: "simple-game-server"
spec:
  ports:
  - name: default
    containerPort: 7654
  template:
    spec:
      containers:
      - name: simple-game-server
        image: us-docker.pkg.dev/codeblind/examples/simple-server:0.27
        resources:
          limits:
            cpu: "250m" #this is our limit here

If you do not set a limit or request, the default is set by Kubernetes at a 100m CPU request.

SDK GameServer sidecar

You may also want to tweak the CPU request or limits on the SDK GameServer sidecar process that spins up alongside each game server container.

You can do this through the Helm configuration when installing Code Blind.

By default, this is set to having a CPU request value of 30m, with no hard CPU limit. This ensures that the sidecar always has enough CPU to function, but it is configurable in case a lower, or higher value is required on your clusters, or if you desire hard limit.

5 - Out of Cluster Dev Server

Running and debugging server binary locally while connected to a full kubernetes stack

This section builds upon the topics discussed in local SDK Server, Local Game Server, and GameServer allocation (discussed here, here, and here). Having a firm understanding of those concepts will be necessary for running an “out of cluster” local server.

Running an “out of cluster” dev server combines the best parts of local debugging and being a part of a cluster. A developer will be able to run a custom server binary on their local machine, even within an IDE with breakpoints. The server would also be allocatable within a cluster, allowing integration with the project’s full stack for handling game server lifetime.

For each run, the only manual steps required by the developer is to manually run the local SDK Server and to run their custom gameplay binary (each can easily be reused/restarted). All other state progression will be automatically handled by the custom gameplay server (calling the SDK API), the SDK Server (handling the SDK calls), the cluster GameServer Controller (progressing specific states), and the cluster’s allocation system (whether be through GameServerAllocation or via the Allocator Service) – just as it would when running in a pod in a cluster!

Out of cluster development is a fantastic option during early prototyping, as it can (optionally) all be run on a single machine with tools such as Minikube.

The name “out of cluster” is to contrast InClusterConfig which is used in the internal golang kubeconfig API.

Prerequisite steps

To be able to run an “out of cluster” local game server, one needs to first complete a few prerequisite steps.

Cluster created

First, a cluster must have been created that the developer has access to through commands like kubectl. This cluster could be running on a provider or locally (e.g. on Minikube). See Create Kubernetes Cluster for more details on how to create a cluster, if not already done so.

Code Blind GameServer resource created

Out of cluster dev servers make use of local dev servers. Follow the instructions there to create a GameServer resource for use with a local game server. Note that the metadata:annotations:agones.dev/dev-address should be updated to point to the local machine, more details below around port forwarding.

SDK Server available

An “out of cluster” dev server requires the need to also run the SDK Server locally.

When a GameServer runs normally in a prod-like environment, the Code Blind cluster controller will handle initializing the containers which contain the SDK Server and the game server binary. The game server binary will be able to connect over gRPC to the SDK Server running in the sidecar container. When the game server binary makes SDK calls (e.g. SDK.Ready()), those get sent to the SDK Server via gRPC and the SDK Server as able to modify the GameServer resource in the cluster. When the GameServer resource gets modified (either by the Code Blind cluster controller, by the Code Blind Allocation Service, or by the K8s API), the SDK Server is monitoring and sends update events over gRPC to the SDK API, resulting in a callback in the game server binary logic.

The goal of an “out of cluster” dev server is to keep all this prod-like functionality, even in a debuggable context. To do so, the developer must run the SDK Server locally such that the (also local) game server binary can connect via gRPC. Instructions for downloading and running the SDK Server can be found here. However, instead of using --local or --file, the SDK Server will need to be run in “out of cluster” mode by providing a kubeconfig file to connect to the cluster. This section is focusing on getting the SDK Server ready to run locally, more detail about running it can be found below.

Game server binary available

When running Code Blind normally, the game server binary is inside a prebuilt docker image which is loaded into a container in a GameServer’s pod. This can either be a custom, developer-created, docker image and contained binary or a sample image/binary from an external source. This document will use the sample simple-game-server, which follows suit from various other documentation pages (e.g. Quickstart: Create a Game Server).

The simple-game-server can be run from the docker image us-docker.pkg.dev/codeblind/examples/simple-server:0.27. The game server binary can either be run within a docker container or run locally, so long as all ports are published/forward – more on this below.

Alternatively, the simple-game-server can also be run from source code; see examples/simple-game-server/main.go. More details about running from source can be found here.

Disclaimer: Code Blind is run and tested with the version of Go specified by the GO_VERSION variable in the project’s build Dockerfile. Other versions are not supported, but may still work.

If a developer has their own game server logic, written in the language of their choice, that would be perfectly fine. A custom game server can be similarly run within a docker container, run directly on commandline, or run via an IDE/debugger.

Forwarded Ports

As the game server binary will be run on the developer’s machine and a requesting client will attempt to connect to the game server via the GameServer’s metadata:annotations:agones.dev/dev-address and spec:ports:hostPort fields, the developer needs to ensure that connection can take place.

If the game server binary and the arbitrary connecting client logic are both on the same network, then connecting should work without any extra steps. However, if the developer has a more complicated network configuration or if they are attempting to connect over the public internet, extra steps may be required.

Obviously, this document does not know what every developer’s specific network configuration is, how their custom game client(s) work, their development environment, and/or various other factors. The developer will need to figure out which steps are necessary for their specific configuration.

If attempting to connect via the internet, the developer needs to set the GameServer’s metadata:annotations:agones.dev/dev-address field to their public IP. This can be found by going to whatsmyip.org or whatismyip.com in a web browser.

The GameServer’s spec:ports:hostPort/spec:ports:containerPort should be set to whichever port the game server binary’s logic will bind to – the port used by simple-game-server is 7654 (by default). The local network’s router must also be configured to forward this port to the desired machine; allowing inbound external requests (from the internet) to be directed to the machine on the network that is running the game server.

If the SDK Server is run on the same machine as the game server binary, no extra steps are necessary for the two to connect. By default, the SDK API (in the game server binary) will attempt to gRPC connect to the SDK Server on localhost on the port 9357. If the SDK Server is run on another machine, or if the SDK Server is set to use different ports (e.g. via commandline arguments), the developer will need to also take appropriate steps to ensure that the game server can connect to the SDK Server. As discussed further below running the SDK Server with --address 0.0.0.0 can be quite helpful with various setups.

If the developer is running the SDK Server or the game server binary within docker container(s), then publishing ports and/or connecting to a docker network may be necessary. Again, these configurations can vary quite dramatically and the developer will need to find the necessary steps for their specific setup.

Running “out of cluster” local game server

Now that all prerequisite steps have been completed, the developer should have:

Optional GameServer state monitoring

A helpful (optional) step to see progress when running is to watch the GameServer resource.

This can be done with the command:

kubectl get --watch -n default gs my-local-server

It may be necessary to replace default and my-local-server with whichever namespace/name values are used by the dev GameServer created above).

With this command running, the terminal will automatically show updates to the GameServer’s state – however, this is not necessary to proceed.

Running SDK Server locally

The first step is to run the SDK Server, making it available for the (later run) game server binary to connect. Here is a sample command to run the SDK Server, with each argument discussed after.

./sdk-server.linux.amd64 \
  --gameserver-name my-local-server \
  --pod-namespace default \
  --kubeconfig "$HOME/.kube/config" \
  --address 0.0.0.0 \
  --graceful-termination false
  • --gameserver-name is a necessary arg, passed instead of the GAMESERVER_NAME enviroment variable.
    • It is set to the name of the dev GameServer k8s resource.
    • It tells the SDK Sever which resource to read/write to on the k8s cluster.
    • This example value of my-local-server matches to the instructions for setting up a Local Game Server.
  • --pod-namespace is a necessary arg, passed instead of the POD_NAMESPACE enviroment variable.
    • It is set set to the namespace which the dev GameServer resides in.
    • It tells the SDK Sever which namespace to look under for the GameServer to read/write to on the k8s cluster.
    • This example value of default is used as most instructions in this documentation assumes GameServers to be created in the default namespace.
  • --kubeconfig tells the SDK Server how to connect to the k8s cluster.
    • This actually does not trigger any special flow (unlike --local or --file). The SDK Server will run just as it would when created in a sidecar container in a k8s cluster.
    • Passing this argument simply provides where to connect along with the credentials to do so.
    • This example value of "$HOME/.kube/config" is the default location for k8s authentication information. This requires the developer be logged in via kubectl and have the desired cluster selected via kubectl config use-context.
  • --address specifies the binding IP address for the SDK Server’s SDK API.
    • By default, the binding address is localhost. This may be difficult for some development setups.
    • Overriding this value changes which IP address(es) the server will bind to for receiving gRPC/REST SDK API calls.
    • This example value of 0.0.0.0 sets the SDK Server to receive API calls that are sent to any IP address (that reach the machine).
  • --graceful-termination set to false will disable some smooth state transitions when exiting.
    • By default, the SDK Server will wait until the GameServer has reached the Shutdown state before exiting (“graceful termination”).
    • This will cause the SDK Server to hang (waiting on state update) when attempting to terminate (e.g. with ^C).
    • When running binaries in a development context, quickly exiting and restarting the SDK Server is handy.

This can easily be terminated with ^C and restarted as necessary. Note that terminating the SDK Server while the game server binary (discussed in the next section) is using it may result in failure to update/watch GameServer state and may result in a runtime error in the game server binary.

Running game server binary locally

Now that the SDK Server is running locally with k8s credentials, the game server binary can run in an integrated fashion. The game server binary’s SDK calls will reach the local SDK Server, which will then interact with the GameServer resource on the k8s cluster.

Again, this document will make use of simple-game-server via its docker image, but running directly or use of a custom game server binary is just as applicable. Run the game server binary with the command:

docker run --rm --network="host" us-docker.pkg.dev/codeblind/examples/simple-server:0.27

The --rm flag will nicely autoclean up the docker container after exiting. The --network="host" flag will tell the docker container to use the host’s network stack directly; this allows calls to localhost to reach the SDK Server. The commands and flags used will likely differ if running a custom game server binary.

If the earlier kubectl get --watch command was run, it will now show the GameServer progressed to the RequestReady state, which will automatically be progressed to the Ready state by the Code Blind controller on the cluster.

The GameServer state can further be modified by SDK calls, gRPC/REST calls, allocation via either GameServerAllocation or Allocator Service, K8s API calls, etc. These changes will be shown by the kubectl get --watch command. These changes will also be picked up by the game server binary, if there is a listener registered through the SDK API. This means that this GameServer can be allocated just as it would be when running completely on k8s, but it can be locally debugged.

If the server crashes or is killed by the developer, it can easily be restarted. This can be done without restarting the SDK Server or any other manual intevention with the GameServer resource. Naturally, this may have implications on any connected clients, but that is project specific and left to the developer to handle.

6 - Allocator Service

Code Blind provides an mTLS based allocator service that is accessible from outside the cluster using a load balancer. The service is deployed and scales independent to Code Blind controller.

To allocate a game server, Code Blind provides a gRPC and REST service with mTLS authentication, called agones-allocator that can be used instead of GameServerAllocations.

Both gRPC and REST are accessible through a Kubernetes service that can be externalized using a load balancer. By default, gRPC and REST are served from the same port. However, either service can be disabled or the services can be served from separate ports using the helm configuration.

For requests to either service to succeed, a client certificate must be provided that is in the authorization list of the allocator service. The remainder of this article describes how to manually make a successful allocation request using the API.

The guide assumes you have command line tools installed for jq, go and openssl.

GameServerAllocation vs Allocator Service

There are several reasons you may prefer to use the Allocator Service over the GameServerAllocation custom resource definition, depending on your architecture and requirements:

  • A requirement to do multi-cluster allocation.
  • Want to create Allocations from outside the Code Blind Kubernetes cluster.
  • Prefer SSL based authentication over Kubernetes RBAC.
  • Prefer a gRPC or REST based API over an integration with the Kubernetes API.

Find the external IP

The service is hosted under the same namespace as the Code Blind controller. To find the external IP of your allocator service, replace agones-system namespace with the namespace to which Code Blind is deployed and execute the following command:

kubectl get service agones-allocator -n agones-system

The output of the command should look like:

NAME                        TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)            AGE
agones-allocator            LoadBalancer   10.55.251.73    34.82.195.204   443:30250/TCP      7d22h

Server TLS certificate

If the agones-allocator service is installed as a LoadBalancer using a reserved IP, a valid self-signed server TLS certificate is generated using the IP provided. Otherwise, the server TLS certificate should be replaced. If you installed Code Blind using helm, you can easily reconfigure the allocator service with a preset IP address by setting the agones.allocator.service.loadBalancerIP parameter to the address that was automatically assigned to the service and helm upgrade:

EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
helm upgrade my-release agones/agones -n agones-system --wait \
   --set agones.allocator.service.loadBalancerIP=${EXTERNAL_IP} \
   ...

Another approach is to replace the default server TLS certificate with a certificate with CN and subjectAltName. There are multiple approaches to generate a certificate. Code Blind recommends using cert-manager.io solution for cluster level certificate management.

In order to use the cert-manager solution, first install cert-manager on the cluster. Then, configure an Issuer/ClusterIssuer resource and last configure a Certificate resource to manage allocator-tls Secret. Make sure to configure the Certificate based on your system’s requirements, including the validity duration.

Here is an example of using a self-signed ClusterIssuer for configuring allocator-tls Secret:

#!/bin/bash
# Create a self-signed ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned
spec:
  selfSigned: {}
EOF

EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# for EKS use hostname
# HOST_NAME=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Create a Certificate with IP for the allocator-tls secret
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: allocator-tls
  namespace: agones-system
spec:
  commonName: ${EXTERNAL_IP}
  ipAddresses:
    - ${EXTERNAL_IP}
  secretName: allocator-tls
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer
EOF

# Wait for the allocator-tls Secret
sleep 1
TLS_CA_VALUE=$(kubectl get secret allocator-tls -n agones-system -ojsonpath='{.data.ca\.crt}')

# Add ca.crt to the allocator-tls-ca Secret
kubectl get secret allocator-tls-ca -o json -n agones-system | jq '.data["tls-ca.crt"]="'${TLS_CA_VALUE}'"' | kubectl apply -f -
echo $TLS_CA_VALUE | base64 -d > ca.crt
# In case of MacOS
# echo $TLS_CA_VALUE | base64 -D > ca.crt

Bring Your Own Certificates (advanced)

If you would like to completely manage the tls secrets outside of helm, you can create them in the namespace where agones is going to be installed, and then set the helm value agones.allocator.disableSecretCreation to true. This method will also work with the cert-manager method, as long as your certificate and secret are created ahead of time, and you populate the allocator-tls-ca and allocator-client-ca yourself.

Client Certificate

Because agones-allocator uses an mTLS authentication mechanism, a client must provide a certificate that is accepted by the server.

If Code Blind is installed using Helm, you can leverage a default client secret, allocator-client.default, created in the game server namespace and allowlisted in allocator-client-ca Kubernetes secret. You can extract and use that secret for client side authentication, by following the allocation example.

Otherwise, here is an example of generating a client certificate using openssl.

#!/bin/bash

EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout client.key -out client.crt -addext 'subjectAltName=IP:'${EXTERNAL_IP}''

CERT_FILE_VALUE=$(cat client.crt | base64 -w 0)

# In case of MacOS
# CERT_FILE_VALUE=$(cat client.crt | base64)

# allowlist client certificate
kubectl get secret allocator-client-ca -o json -n agones-system | jq '.data["client_trial.crt"]="'${CERT_FILE_VALUE}'"' | kubectl apply -f -

The last command creates a new entry in the secret data map for allocator-client-ca for the client CA. This is for the agones-allocator service to accept the newly generated client certificate.

Send allocation request

After setting up agones-allocator with server certificate and allowlisting the client certificate, the service can be used to allocate game servers. Make sure you have a fleet with ready game servers in the game server namespace.

Set the environment variables and store the client secrets before allocating using gRPC or REST APIs:

NAMESPACE=default # replace with any namespace
EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
KEY_FILE=client.key
CERT_FILE=client.crt
TLS_CA_FILE=ca.crt

# allocator-client.default secret is created only when using helm installation. Otherwise generate the client certificate and replace the following.
# In case of MacOS replace "base64 -d" with "base64 -D"
kubectl get secret allocator-client.default -n "${NAMESPACE}" -ojsonpath="{.data.tls\.crt}" | base64 -d > "${CERT_FILE}"
kubectl get secret allocator-client.default -n "${NAMESPACE}" -ojsonpath="{.data.tls\.key}" | base64 -d > "${KEY_FILE}"
kubectl get secret allocator-tls-ca -n agones-system -ojsonpath="{.data.tls-ca\.crt}" | base64 -d > "${TLS_CA_FILE}"

Using gRPC

To start, take a look at the allocation gRPC client examples in golang and C# languages. In the following, the golang gRPC client example is used to allocate a Game Server in the default namespace.

#!/bin/bash

go run examples/allocator-client/main.go --ip ${EXTERNAL_IP} \
    --port 443 \
    --namespace ${NAMESPACE} \
    --key ${KEY_FILE} \
    --cert ${CERT_FILE} \
    --cacert ${TLS_CA_FILE}

Using REST

#!/bin/bash

curl --key ${KEY_FILE} \
     --cert ${CERT_FILE} \
     --cacert ${TLS_CA_FILE} \
     -H "Content-Type: application/json" \
     --data '{"namespace":"'${NAMESPACE}'"}' \
     https://${EXTERNAL_IP}/gameserverallocation \
     -X POST

You should expect to see the following output:

{"gameServerName":"game-server-name","ports":[{"name":"default","port":7463}],"address":"1.2.3.4","nodeName":"node-name"}

Sending Data to the Game Server

The service accepts a metadata field, which can be used to apply labels and annotations to the allocated GameServer. The old metaPatch fields is now deprecated, but can still be used for compatibility. If both metadata and metaPatch fields are set, metaPatch is ignored.

Secrets Explained

agones-allocator has a dependency on three Kubernetes secrets:

  1. allocator-tls - stores the server certificate.
  2. allocator-client-ca - stores the allocation authorized client CA for mTLS to allowlist client certificates.
  3. allocator-tls-ca (optional) - stores allocator-tls CA.

The separation of CA secret from the private secret is for the security reason to avoid reading the private secret, while retrieving the allocator CA that is used by the allocation client to validate the server. It is optional to set or maintain the allocator-tls-ca secret.

Troubleshooting

If you encounter problems, explore the following potential root causes:

  1. Check server certificate - Using openssl you can get the certificate chain for the server.

    EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    openssl s_client -connect ${EXTERNAL_IP}:443
    
    • Inspect the server certificate by storing the certificate returned, under Server certificate and validating using openssl x509 -in tls.crt -text -noout.
    • Make sure the certificate is not expired and the Subject Alternative Name is set.
    • If the issuer is CN = allocation-ca, the certificate is generated using Code Blind helm installation.
  2. Check client certificate

    • You may get an error such as rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed, make sure your client certificate is allowlisted by being added to allocator-client-ca.
    kubectl get secret allocator-client-ca -o json -n agones-system
    
    • If the server certificate is not accepted by the client, you may get an error such as rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority", depending on the client. In this case, verify that the TLS CA file matches the server certificate.
    kubectl get secret allocator-tls -n agones-system -ojsonpath="{.data.tls\.crt}" | base64 -d > tls.crt
    openssl verify -verbose -CAfile ca.crt tls.crt
    tls.crt: OK
    
  3. Make sure the service is up and running.

    kubectl get pod -n agones-system | grep agones-allocator
    agones-allocator-59b4f6b5c6-86j62      1/1     Running     0          6m36s
    agones-allocator-59b4f6b5c6-kbqrq      1/1     Running     0          6m45s
    agones-allocator-59b4f6b5c6-trbkl      1/1     Running     0          6m28s
    
    kubectl get service agones-allocator -n agones-system
    agones-allocator   LoadBalancer   10.55.248.14   34.82.195.204    443:32468/TCP   6d23h
    

API Reference

The AllocationService API is located as a gRPC service here. Additionally, the REST API is available as a Swagger API.

7 - Multi-cluster Allocation

In order to allow allocation from multiple clusters, Code Blind provides a mechanism to set redirect rules for allocation requests to the right cluster.

There may be different types of clusters, such as on-premise, and Google Kubernetes Engine (GKE), used by a game to help with the cost-saving and availability. For this purpose, Code Blind provides a mechanism to define priorities on the clusters. Priorities are defined on GameServerAllocationPolicy agones CRD. A matchmaker can enable the multi-cluster rules on a request and target agones-allocator endpoint in any of the clusters and get resources allocated on the cluster with the highest priority. If the cluster with the highest priority is overloaded, the allocation request is redirected to the cluster with the next highest priority.

The remainder of this article describes how to enable multi-cluster allocation.

Define Cluster Priority

GameServerAllocationPolicy is the CRD defined by Code Blind for setting multi-cluster allocation rules. In addition to cluster priority, it describes the connection information for the target cluster, including the game server namespace, agones-allocator endpoint and client K8s secrets name for redirecting the allocation request. Game servers will be allocated from clusters with the lowest priority number. If there are no available game servers available in clusters with the lowest priority number, they will be allocated from clusters with the next lowest priority number. For clusters with the same priority, the cluster is chosen with a probability relative to its weight.

Here is an example of setting the priority for a cluster and it’s connection rules. One such resource should be defined per cluster.

In the following example the policy is defined for cluster B in cluster A.

cat <<EOF | kubectl apply -f -
apiVersion: multicluster.agones.dev/v1
kind: GameServerAllocationPolicy
metadata:
  name: allocator-cluster-b
  namespace: cluster-a-ns
spec:
  connectionInfo:
    allocationEndpoints:
    - 34.82.195.204
    clusterName: "clusterB"
    namespace: cluster-b-ns
    secretName: allocator-client-to-cluster-b
    serverCa: c2VydmVyQ0E=
  priority: 1
  weight: 100
EOF

To define the local cluster priority a GameServerAllocationPolicy should be defined without an allocationEndpoints field. If the local cluster priority is not defined, the allocation from the local cluster happens only if allocation from other clusters with the existing allocation rules is unsuccessful.

Allocation requests with multi-cluster allocation enabled but with only the local cluster available (e.g. in development) must have a local cluster priority defined, or the request fails with the error “no multi-cluster allocation policy is specified”.

The namespace field in connectionInfo is the namespace that the game servers will be allocated in, and must be a namespace in the target cluster that has been previously defined as allowed to host game servers. The Namespace specified in the allocation request (below) is used to refer to the namespace that the GameServerAllocationPolicy itself is located in.

serverCa is the server TLS CA public certificate, set only if the remote server certificate is not signed by a public CA (e.g. self-signed). If this field is not specified, the certificate can also be specified in a field named ca.crt of the client secret (the secret referred to in the secretName field).

Establish trust

To accept allocation requests from other clusters, agones-allocator for cluster B should be configured to accept the client’s certificate from cluster A and the cluster A’s client should be configured to accept the server TLS certificate, if it is not signed by a public Certificate Authority (CA).

Follow the steps to configure the agones allocator gRPC service. The client certificate pair in the mentioned document is stored as a K8s secret. Here are the secrets to set:

1.Client certificate to talk to other clusters:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: allocator-client-to-cluster-b
  namespace: cluster-a-ns
type: Opaque
data:
  tls.crt: <REDACTED>
  tls.key: <REDACTED>
  ca.crt: <REDACTED>
EOF

The certificates are base 64 string of the certificate file e.g. cat ${CERT_FILE} | base64 -w 0

Code Blind recommends using cert-manager.io solution for generating client certificates.

2.Add client CA to the list of authorized client certificates by agones-allocator in the targeted cluster.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: allocator-client-ca
  namespace: agones-system
type: Opaque
data:
  client1.crt: <REDACTED>
  client2.crt: <REDACTED>
  clientN.crt: <REDACTED>
EOF

Allocate multi-cluster

To enable multi-cluster allocation, set multiClusterSetting.enabled to true in allocation.proto and send allocation requests. For more information visit agones-allocator. In the following, using allocator-client sample, a multi-cluster allocation request is sent to the agones-allocator service. If the allocation succeeds, the AllocationResponse will contain a Source field which indicates the endpoint of the remote agones-allocator.

Set the environment variables and store the client secrets before allocating using gRPC or REST APIs

#!/bin/bash

NAMESPACE=default # replace with any namespace
EXTERNAL_IP=$(kubectl get services agones-allocator -n agones-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
KEY_FILE=client.key
CERT_FILE=client.crt
TLS_CA_FILE=ca.crt

# allocator-client.default secret is created only when using helm installation. Otherwise generate the client certificate and replace the following.
# In case of MacOS replace "base64 -d" with "base64 -D"
kubectl get secret allocator-client.default -n "${NAMESPACE}" -ojsonpath="{.data.tls\.crt}" | base64 -d > "${CERT_FILE}"
kubectl get secret allocator-client.default -n "${NAMESPACE}" -ojsonpath="{.data.tls\.key}" | base64 -d > "${KEY_FILE}"
kubectl get secret allocator-tls-ca -n agones-system -ojsonpath="{.data.tls-ca\.crt}" | base64 -d > "${TLS_CA_FILE}"
#!/bin/bash

go run examples/allocator-client/main.go --ip ${EXTERNAL_IP} \
    --namespace ${NAMESPACE} \
    --key ${KEY_FILE} \
    --cert ${CERT_FILE} \
    --cacert ${TLS_CA_FILE} \
    --multicluster true

If using REST use

#!/bin/bash

curl --key ${KEY_FILE} \
     --cert ${CERT_FILE} \
     --cacert ${TLS_CA_FILE} \
     -H "Content-Type: application/json" \
     --data '{"namespace":"'${NAMESPACE}'", "multiClusterSetting":{"enabled":true}}' \
     https://${EXTERNAL_IP}/gameserverallocation \
     -X POST
     

Troubleshooting

If you encounter problems, explore the following potential root causes:

  1. Make sure single cluster allocation works for each cluster using this troubleshooting.

  2. For each cluster, make sure there is a GameServerAllocationPolicy resource defined in the game server cluster.

  3. Inspect the .spec.connectionInfo for GameServerAllocationPolicy for each cluster. Use the cluster connection information in that field to verify that single cluster allocation works. Use the information to verify the connection:

POLICY_NAME=<policy-name>
POLICY_NAMESPACE=<policy-namespace>

NAMESPACE=$(kubectl get gameserverallocationpolicy ${POLICY_NAME} -n ${POLICY_NAMESPACE} -ojsonpath={.spec.connectionInfo.namespace})
EXTERNAL_IP=$(kubectl get gameserverallocationpolicy ${POLICY_NAME} -n ${POLICY_NAMESPACE} -ojsonpath={.spec.connectionInfo.allocationEndpoints\[0\]})
CLIENT_SECRET_NAME=$(kubectl get gameserverallocationpolicy ${POLICY_NAME} -n ${POLICY_NAMESPACE} -ojsonpath={.spec.connectionInfo.secretName})

KEY_FILE=client.key
CERT_FILE=client.crt
TLS_CA_FILE=ca.crt

# In case of MacOS replace "base64 -d" with "base64 -D"
kubectl get secret "${CLIENT_SECRET_NAME}" -n "${POLICY_NAMESPACE}" -ojsonpath="{.data.tls\.crt}" | base64 -d > "${CERT_FILE}"
kubectl get secret "${CLIENT_SECRET_NAME}" -n "${POLICY_NAMESPACE}" -ojsonpath="{.data.tls\.key}" | base64 -d > "${KEY_FILE}"
kubectl get secret "${CLIENT_SECRET_NAME}" -n "${POLICY_NAMESPACE}" -ojsonpath="{.data.ca\.crt}" | base64 -d > "${TLS_CA_FILE}"
#!/bin/bash

go run examples/allocator-client/main.go --ip ${EXTERNAL_IP} \
    --port 443 \
    --namespace ${NAMESPACE} \
    --key ${KEY_FILE} \
    --cert ${CERT_FILE} \
    --cacert ${TLS_CA_FILE}

8 - GameServer Pod Service Accounts

RBAC permissions and service accounts for the GameServer Pod.

Default Settings

By default, Code Blind sets up service accounts and sets them appropriately for the Pods that are created for GameServers.

Since Code Blind provides GameServer Pods with a sidecar container that needs access to Code Blind Custom Resource Definitions, Pods are configured with a service account with extra RBAC permissions to ensure that it can read and modify the resources it needs.

Since service accounts apply to all containers in a Pod, Code Blind will automatically overwrite the mounted key for the service account in the container that is running the dedicated game server in the backing Pod. This is done since game server containers are exposed publicly, and generally don’t require the extra permissions to access aspects of the Kubernetes API.

Bringing your own Service Account

If needed, you can provide your own service account on the Pod specification in the GameServer configuration.

For example:

apiVersion: "agones.dev/v1"
kind: GameServer
metadata:
  generateName: "simple-game-server-"
spec:
  ports:
  - name: default
    containerPort: 7654
  template:
    spec:
      serviceAccountName: my-special-service-account # a custom service account
      containers:
      - name: simple-game-server
        image: us-docker.pkg.dev/codeblind/examples/simple-server:0.27

If a service account is configured, the mounted key is not overwritten, as it assumed that you want to have full control of the service account and underlying RBAC permissions.