Best Practices

Best practices for running Code Blind in production.

1: Google Kubernetes Engine Best Practices

Overview

Running Code Blind in production takes consideration, from planning your launch to figuring out the best course of action for cluster and Code Blind upgrades. On this page, we’ve collected some general best practices. We also have cloud specific pages for:

Google Kubernetes Engine (GKE)

If you are interested in submitting best practices for your cloud prodiver / on-prem, please contribute!

When running in production, Code Blind should be scheduled on a dedicated pool of nodes, distinct from where Game Servers are scheduled for better isolation and resiliency. By default Code Blind prefers to be scheduled on nodes labeled with agones.dev/agones-system=true and tolerates the node taint agones.dev/agones-system=true:NoExecute. If no dedicated nodes are available, Code Blind will run on regular nodes. See taints and tolerations for more information about Kubernetes taints and tolerations.

If you are collecting Metrics using our standard Prometheus installation, see the installation guide for instructions on configuring a separate node pool for the agones.dev/agones-metrics=true taint.

See Creating a Cluster for initial set up on your cloud provider.

Redundant Clusters

Allocate Across Clusters

Code Blind supports Multi-cluster Allocation, allowing you to allocate from a set of clusters, versus a single point of potential failure. There are several other options for multi-cluster allocation:

Anthos Service Mesh can be used to route allocation traffic to different clusters based on arbitrary criteria. See Global Multiplayer Demo for an example where the match maker influences which cluster the allocation is routed to.
Allocation Endpoint can be used in Cloud Run to proxy allocation requests.
Or peruse the Third Party Examples

Spread

You should consider spreading your game servers in two ways:

Across geographic fault domains (GCP regions, AWS availability zones, separate datacenters, etc.): This is desirable for geographic fault isolation, but also for optimizing client latency to the game server.
Within a fault domain: Kubernetes Clusters are single points of failure. A single misconfigured RBAC rule, an overloaded Kubernetes Control Plane, etc. can prevent new game server allocations, or worse, disrupt existing sessions. Running multiple clusters within a fault domain also allows for easier upgrades.

1 - Google Kubernetes Engine Best Practices

Best practices for running Code Blind on Google Kubernetes Engine (GKE).

Overview

On this page, we’ve collected several Google Kubernetes Engine (GKE) best practices.

Release Channels

Why?

We recommned using Release Channels for all GKE clusters. Using Release Channels has several advantages:

Google automatically manages the version and upgrade cadence for your Kubernetes Control Plane and its nodes.
Clusters on a Release Channel are allowed to use the No minor upgrades and No minor or node upgrades scope of maintenance exclusions - in other words, enrolling a cluster in a Release Channel gives you more control over node upgrades.
Clusters enrolled in rapid channel have access to the newest Kubernetes version first. Code Blind strives to support the newest release in rapid channel to allow you to test the newest Kubernetes soon after it’s available in GKE.

Note

GKE Autopilot clusters must be on Release Channels.

What channel should I use?

We recommend the regular channel, which offers a balance between stability and freshness. See this guide for more discussion.

If you need to disallow minor version upgrades for more than 6 months, consider choosing the freshest Kubernetes version possible: Choosing the freshest version on rapid or regular will extend the amount of time before your cluster reaches end of life.

What versions are available on a given channel?

You can query the versions available across different channels using gcloud:

gcloud container get-server-config \
  --region=[COMPUTE_REGION] \
  --flatten="channels" \
  --format="yaml(channels)"

Replace the following:

COMPUTE_REGION: the Google Cloud region where you will create the cluster.

Managing Game Server Disruption on GKE

If your game session length is less than an hour, use the eviction API to configure your game servers appropriately - see Controlling Disruption.

For sessions longer than an hour, there are currently two possible approaches to manage disruption:

(GKE Standard/Autopilot) Blue/green deployment at the cluster level: If you are using an automated deployment process, you can:
- create a new, green cluster within a release channel e.g. every week,
- use maintenance exclusions to prevent node upgrades for 30d, and
- scale the Fleet on the old, blue cluster down to 0, and
- use multi-cluster allocation on Code Blind, which will then direct new allocations to the new green cluster (since blue has 0 desired), then
- delete the old, blue cluster when the Fleet successfully scales down.
(GKE Standard only) Use node pool blue/green upgrades

Best Practices

Overview

Separation of Code Blind from GameServer nodes

Redundant Clusters

Allocate Across Clusters

Spread

1 - Google Kubernetes Engine Best Practices

Overview

Release Channels

Why?

Note

What channel should I use?

What versions are available on a given channel?

Managing Game Server Disruption on GKE