<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[timberry.dev]]></title><description><![CDATA[timberry.dev]]></description><link>https://timberry.dev</link><generator>RSS for Node</generator><lastBuildDate>Sat, 11 Apr 2026 23:46:23 GMT</lastBuildDate><atom:link href="https://timberry.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[What are Argo Rollouts?]]></title><description><![CDATA[Next up in my quest to Learn All The Things™ on the CNCF graduated projects page, we’re going to take a look at a lesser known Argo project: Argo Rollouts. In a nutshell, Argo Rollouts are a drop-in replacement for the Deployment object that provide ...]]></description><link>https://timberry.dev/what-are-argo-rollouts</link><guid isPermaLink="true">https://timberry.dev/what-are-argo-rollouts</guid><category><![CDATA[argo]]></category><category><![CDATA[argo rollout]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[CNCF]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Fri, 15 Aug 2025 09:00:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754988537703/c51c4889-9a9d-479d-92cf-272925694502.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next up in my quest to Learn All The Things™ on the CNCF <a target="_blank" href="https://www.cncf.io/projects/">graduated projects</a> page, we’re going to take a look at a lesser known Argo project: Argo Rollouts. In a nutshell, <a target="_blank" href="https://argoproj.github.io/rollouts/">Argo Rollouts</a> are a drop-in replacement for the <code>Deployment</code> object that provide much more automation of common progressive deployment patterns such as Canary and Blue-Green. Rollouts can also optionally integrate with ingress controllers and service meshes, and can even query and interpret metrics via APIs to drive their autonomous behaviour.</p>
<h2 id="heading-deployment-strategies">Deployment Strategies</h2>
<p>The Kubernetes <a target="_blank" href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">Deployment</a> object is probably the resource we use most often, at least until we start building more advanced clusters that leverage service meshes. It’s a rock-solid native Kubernetes object that helps us declare what a set of <code>Pods</code> should look like that run a workload. It provides the famous Kubernetes control loop that keeps the <code>Pods</code> we’ve declared running, and it also offers some basic functionality for safely updating a workload.</p>
<p>Recall that a <code>Deployment</code>, under the hood, is managing <code>ReplicaSets</code>, which you can think of as versions of our workload configuration. When we make a change to that configuration, a new <code>ReplicaSet</code> is created, and a previous <code>ReplicaSet</code> is ultimately deprecated. <code>Deployments</code> attempt to do this safely using one of two methods:</p>
<ul>
<li><p><code>RollingUpdate</code> (the default): The new <code>ReplicaSet</code> is gradually scaled up to the desired number of <code>Pods</code>, as the old <code>ReplicaSet</code> is gradually scaled down. We can also influence this type of update by limiting how many <code>Pods</code> we’ll tolerate as unavailable (with <code>MaxUnavailable</code>) and how many additional <code>Pods</code> we will allow during the update (with <code>MaxSurge</code>).</p>
</li>
<li><p><code>Recreate</code>: With this strategy, the entire existing <code>ReplicaSet</code> is scaled down and terminated before the new one is created. This is sometimes helpful if you want a clean cut off of traffic between different versions of your workload.</p>
</li>
</ul>
<p>Often when we’re first trying Kubernetes, we learn how to implement versions of the canary and blue-green patterns by combining multiple <code>Deployments</code> with a <code>Service</code> object.</p>
<p>For example, we can run a blue <code>Deployment</code> and a green <code>Deployment</code>, and switch between them easily with a <code>Service</code> selector. Or we can run a canary <code>Deployment</code> with a smaller number of <code>Pods</code>, and let the <code>Service</code> object select this along with a larger production <code>Deployment</code>. But these techniques don’t scale well, and they require constant manual intervention to manage. This is where Argo’s automation can help.</p>
<h2 id="heading-the-rollout-object">The Rollout Object</h2>
<p>Argo provides us with a new custom resource definition (CRD): the <code>Rollout</code>.</p>
<p>Essentially this object combines everything we can declare in a <code>Deployment</code> object with a much more advanced strategy definition. Within the strategy we can now describe the steps required to successfully rollout updates using the canary or blue-green patterns, including traffic splitting and approval steps. Let’s walk through a basic example to see how this works!</p>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>To follow along, you’ll need access to a Kubernetes cluster. I’m normally a fan of <a target="_blank" href="https://kind.sigs.k8s.io/">Kind</a>, or even <a target="_blank" href="https://minikube.sigs.k8s.io/docs/"><strong>Minikube</strong></a>, but when writing this post I struggled to get local forwarding of the <code>LoadBalancer</code> to work reliably enough to actually demonstrate traffic splitting. You might have more success than me and you’re welcome to try! But full disclosure, I spun up a GKE cluster in the end.</p>
<h2 id="heading-installing-argo-rollouts">Installing Argo Rollouts</h2>
<p>To set up Argo Rollouts we’ll create a namespace for the Argo controller, and we’ll install the other CRDs we need:</p>
<pre><code class="lang-bash">kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
</code></pre>
<p>We’ll also install the Rollouts plugin for <code>kubectl</code>, which will give us access to the <code>kubectl argo rollouts</code> sub-commands. You can obtain this from the <a target="_blank" href="https://github.com/argoproj/argo-rollouts/releases">releases</a> page, or if you’re using Homebrew just run:</p>
<pre><code class="lang-bash">brew install argoproj/tap/kubectl-argo-rollouts
</code></pre>
<h2 id="heading-creating-a-rollout">Creating a Rollout</h2>
<p>We’re going to create a <code>Rollout</code> object that uses the rather excellent Argo Rollouts web app. This app gives us a really nice visualisation of what’s happening as we release or rollback updates. We’ll also create a <code>LoadBalancer</code> object so we can access the app in a browser. Let’s start by creating the <code>rollout.yaml</code> file below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Rollout</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">rollouts-demo</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">5</span>
  <span class="hljs-attr">strategy:</span>
    <span class="hljs-attr">canary:</span>
      <span class="hljs-attr">steps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">20</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {}
      <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">40</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {<span class="hljs-attr">duration:</span> <span class="hljs-number">10</span>}
      <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">60</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {<span class="hljs-attr">duration:</span> <span class="hljs-number">10</span>}
      <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">80</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {<span class="hljs-attr">duration:</span> <span class="hljs-number">10</span>}
  <span class="hljs-attr">revisionHistoryLimit:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">rollouts-demo</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">rollouts-demo</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">rollouts-demo</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">argoproj/rollouts-demo:blue</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
          <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
          <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
        <span class="hljs-attr">resources:</span>
          <span class="hljs-attr">requests:</span>
            <span class="hljs-attr">memory:</span> <span class="hljs-string">32Mi</span>
            <span class="hljs-attr">cpu:</span> <span class="hljs-string">5m</span>
</code></pre>
<p>As you can see, most of this spec looks very much like a <code>Deployment</code> object. The big difference is the <code>strategy</code> section, which is specific to the <code>Rollout</code> CRD. In this section we specify the <code>canary</code> pattern, and then define the <code>steps</code> that we want for a successful rollout as a list. These are basically the automation instructions the controller will follow when we want to rollout an update.</p>
<ul>
<li><p>First we set the weight of the canary to <code>20</code>. In other words, we ask for 20% of the available <code>Pod</code> replicas to match the canary definition. Elsewhere in the spec we can see there are 5 <code>Pod</code> replicas, so 1 of them will match the canary. Next we have an empty <code>pause</code> definition, which means an indefinite pause; in other words, manual intervention will be required here to promote the rollout and continue with the next steps.</p>
</li>
<li><p>We then proceed with the rest of the steps in the canary. We set the weight to 40% (2 of 5 <code>Pods</code>), and wait for 10 seconds. Then we set the weight to 60% (3 of 5 <code>Pods</code>) and wait for 10 seconds. Then 80% and another 10 seconds, and finally the canary process will complete and all Pods in the Rollout will match the new definition.</p>
</li>
</ul>
<p>A cognitive hurdle I had to get over here is to figure out why we only have a single <code>Pod</code> spec. After all, if we’re defining a canary pattern, shouldn’t there be separate production and canary deployments? And of course, this is the beauty and simplicity of the Argo <code>Rollout</code>.</p>
<p>Every rollout <em>starts</em> as a canary, and eventually <em>becomes</em> production.</p>
<p>We’ll see this in a moment, when the first time we create this object we just skip to having all of our <code>Pods</code> running the <code>rollouts-demo:blue</code> container, but when we perform the first change, we’ll see the canary logic in action.</p>
<p>Okay, next we need a Service object so we can access the workload. This is just a plain old regular LoadBalancer we’ll save as <code>service.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">rollouts-demo</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-string">http</span>
    <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">rollouts-demo</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>Once we’ve applied both of these objects to our cluster, we can watch the status of our <code>Rollout</code> object with this command:</p>
<pre><code class="lang-bash">kubectl argo rollouts get rollout rollouts-demo --watch
</code></pre>
<p>Because this is the initial creation of the object, we immediately scale up to 100% of the replicas running the <code>rollouts-demo:blue</code> container. Remember - the canary logic is only applied to <em>updates</em>, not to the initial creation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755010737061/45baffec-59d1-46a4-a6f4-b418e3dda4de.png" alt class="image--center mx-auto" /></p>
<p>I mentioned earlier that the demo wep app supplied by Argo Rollouts is actually very good, and that’s because it provides a very nice visualisation of the requests being made by a web browser and the version of the <code>Pod</code> that’s serving them.</p>
<p>Grab the external IP of the rollouts-demo service with <code>kubectl get svc</code> and hopefully, you’ll see something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755006213659/54d9bfb7-7b04-4a14-9d38-e312f0e087be.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-updating-a-rollout">Updating a Rollout</h2>
<p>Now it’s time to do our first update! Just like with a <code>Deployment</code> object, a <code>Rollout</code> is managing versions of our <code>Pods</code> using a <code>ReplicaSet</code> object. Right now we just have a single <code>ReplicaSet</code>, and if we make a change, a new <code>ReplicaSet</code> will be created. So let’s patch our <code>Rollout</code> object and change the container image:</p>
<pre><code class="lang-bash">kubectl argo rollouts <span class="hljs-built_in">set</span> image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:yellow
</code></pre>
<p>This is where the <code>Rollout</code> logic comes in. The update will be progressively applied based on the logic we specified earlier. So first we’ll get a new <code>ReplicaSet</code> that will represent 20% of the total Pods. And we’ll pause there, requiring some manual intervention to proceed.</p>
<p>If you’re still running the previous watch command, you can see the updated state of the <code>Rollout</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755010820107/e9ba375a-43d8-4095-99fd-886fd4a4e4e6.png" alt class="image--center mx-auto" /></p>
<p>From this detail we can also see that our Rollout is at step 1 of 8, and is currently paused.</p>
<p>Jump back into your web browser, and you should eventually start to see the occasional request being served by a yellow <code>Pod</code> instead of a blue one. (Note, you may need to reload the page if it gets “stuck” making requests to the same <code>Pods</code> over and over again)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755010965997/c8b9de9d-ee69-4495-b926-7e11a618293e.gif" alt class="image--center mx-auto" /></p>
<p>Like I said, a <code>pause</code> step with no duration defined will just remain paused indefinitely, so we must promote the rollout for it to continue to the next step:</p>
<pre><code class="lang-bash">kubectl argo rollouts promote rollouts-demo
</code></pre>
<p>Now we can observe the <code>Rollout</code> continue through the rest of its defined canary steps, slowly increasing the weight of the update until finally all <code>Pods</code> are running the new version. You can observe this in the output of <code>kubectl argo rollouts get</code>, but it’s much prettier to watch it on the demo web app:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755011519592/1e787c49-2de4-46c7-bd11-04f86574132b.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-aborting-a-rollout">Aborting a Rollout</h2>
<p>The canary pattern is of course about letting us try an update with a small subset of production traffic. So when we’re at the manual intervention stage, we can abort the rollout instead of promoting it, which will return the <code>Rollout</code> to its previous state.</p>
<p>Give this a try yourself, by first updating from the yellow container to the red one:</p>
<pre><code class="lang-bash">kubectl argo rollouts <span class="hljs-built_in">set</span> image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:red
</code></pre>
<p>At this point you’ll have a canary running the red version (weighted at about 20%). Run the following command to abort, rather than promote, this rollout:</p>
<pre><code class="lang-bash">kubectl argo rollouts abort rollouts-demo
</code></pre>
<p>Now you can watch everything rollback to the previous version.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755012000995/455e9089-bced-46e0-918e-3a7870057150.gif" alt class="image--center mx-auto" /></p>
<p>This, however, puts our <code>Rollout</code> in a degraded state. This is the definition of an abort as opposed to a rollback, and we can see this detail in the watch view:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755012093142/472c7e44-fb67-4dc9-aa47-b4cc1ac0dcfd.png" alt class="image--center mx-auto" /></p>
<p>To fix this we need to “re-declare” the state we want to match the state we currently have. If our code specified the <code>rollouts-demo:yellow</code> container we could simply re-apply the object. In our case, it’s quicker to patch the object again:</p>
<pre><code class="lang-bash">kubectl argo rollouts <span class="hljs-built_in">set</span> image rollouts-demo \
  rollouts-demo=argoproj/rollouts-demo:yellow
</code></pre>
<p>No actual changes to <code>Pods</code> are required because we’re already running 100% yellow containers, we’re just reconciling the current state of the cluster with what <em>should</em> be running. This means the state of the <code>Rollout</code> will immediately change to healthy.</p>
<h1 id="heading-summary">Summary</h1>
<p>This has been a very short tour of Argo Rollouts, where really we’ve just demonstrated how the Rollout object serves as a more advanced drop-in replacement for a <code>Deployment</code>. But by doing this, hopefully I’ve helped demystify how this project works, and you can start to appreciate how useful it can be.</p>
<p>Stay tuned for the final stop on our journey through the Argo project - Argo Events!</p>
]]></content:encoded></item><item><title><![CDATA[What are Argo Workflows?]]></title><description><![CDATA[Continuing my efforts to explore and Learn All The Things™ on the CNCF graduated projects page, today I’m going to take a look at Argo Workflows, which is one of 4 tools in the Argo project overall (see my previous post on ArgoCD). Argo Workflows is ...]]></description><link>https://timberry.dev/what-are-argo-workflows</link><guid isPermaLink="true">https://timberry.dev/what-are-argo-workflows</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[argo]]></category><category><![CDATA[workflows]]></category><category><![CDATA[dags]]></category><category><![CDATA[CNCF]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Wed, 06 Aug 2025 10:13:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754323783257/234b078c-68bf-4429-ac0d-4bf42ab5e8ce.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Continuing my efforts to explore and Learn All The Things™ on the CNCF <a target="_blank" href="https://www.cncf.io/projects/">graduated projects</a> page, today I’m going to take a look at Argo Workflows, which is one of 4 tools in the Argo project overall (see my previous post on <a target="_blank" href="https://timberry.dev/what-is-argo-cd">ArgoCD</a>). <a target="_blank" href="https://argoproj.github.io/workflows/">Argo Workflows</a> is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. It’s is an extremely powerful tool, but I think learning about it could benefit from a bit more scene-setting and context, which I humbly hope to impart in this post.</p>
<h2 id="heading-what-is-a-workflow">What is a Workflow?</h2>
<p>Before we jump into Argo specifically, I think it will help to understand what we mean by a <strong>Workflow</strong>.</p>
<p>Many folks may be used to setting up resources in Kubernetes that are designed to keep running - such as Deployments for example. They may change and mutate, or scale up and down, but generally once we’ve declared that we want a service or workload to exist, we’re used to it being there until its eventual end of life. These are basically “long running services”, but not everything we need to accomplish fits that model.</p>
<p>Sometimes we want to run a workload that does <em>something</em>, or maybe even a series of <em>somethings</em> and then stops. It executes successfully, and exits successfully, content in the knowledge of a job well done. This is the workflow pattern, and its used extensively for things like data processing and infrastructure automation.</p>
<h3 id="heading-kind-of-like-a-job">Kind of like a Job?</h3>
<p>A Kubernetes <a target="_blank" href="https://kubernetes.io/docs/concepts/workloads/controllers/job/">Job</a> is an implementation of this pattern, designed for various batch tasks, but with a very simplistic approach. Kubernetes Jobs typically execute a single task or batch job to completion, but they have very basic error handling and limited scope for dependency management.</p>
<p>By comparison, Argo can handle complex multi-step workflows, with built-in support for dependencies, retry logic and even handling artifacts between workflow steps. Now we’ve set the scene for the problem we’re tying to solve, let’s get hands on and try them out for ourselves!</p>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>To follow along, you’ll need access to a Kubernetes cluster. A simple local dev cluster will do, such as you might run with <a target="_blank" href="https://kind.sigs.k8s.io/"><strong>Kind</strong></a> or <a target="_blank" href="https://minikube.sigs.k8s.io/docs/"><strong>Minikube</strong></a>.</p>
<h2 id="heading-installing-argo-workflows"><strong>Installing Argo Workflows</strong></h2>
<p>To get started, we’ll run the following commands to create a dedicated namespace called <code>argo</code> and install the necessary CRDs (just like we did in our previous post!). In this command we’re specifying Argo version 3.7.0, but there may be a newer version available (check their <a target="_blank" href="https://github.com/argoproj/argo-workflows/releases">releases page here</a>).</p>
<pre><code class="lang-bash">kubectl create ns argo
kubectl apply -n argo -f <span class="hljs-string">"https://github.com/argoproj/argo-workflows/releases/download/v3.7.0/quick-start-minimal.yaml"</span>
</code></pre>
<p>Along with the CRDs we need to define Argo objects, we’ve also now installed a few components into the <code>argo</code> namespace. These are:</p>
<ul>
<li><p><strong>argo-server</strong>: This is the central component that provides an API for interacting with Argo workflows. It also provides a web UI if you like that sort of thing.</p>
</li>
<li><p><strong>workflow-controller</strong>: This component is responsible for managing the execution of workflows in Argo. It does things like watch for new workflow submissions, orchestrates their execution and manages their state.</p>
</li>
<li><p><strong>httpbin</strong>: This is a simple HTTP request and response service. It’s often used for testing workflows.</p>
</li>
<li><p><strong>minio</strong>: A high-performance, distributed object storage that should probably get its own blog post in this series some day! Argo uses minio to handle artifact storage during workflow execution.</p>
</li>
</ul>
<p>Argo also has its own CLI for Workflows. Whereas the CLI for ArgoCD was called <code>argocd</code>, the CLI for Argo Workflows is just called <code>argo</code>. This doesn’t bother me at all 😬 so let’s go ahead and install it.</p>
<p>You can grab a binary from the <a target="_blank" href="https://github.com/argoproj/argo-workflows/releases/">releases page</a>, or if you’re using Homebrew run:</p>
<pre><code class="lang-bash">brew install argo
</code></pre>
<p>The argo CLI only has a handful of simple commands, and this is all we need to interact with the Argo server:</p>
<ul>
<li><p>To submit a workflow, we use <code>argo submit</code></p>
</li>
<li><p>To list current workflows, we use <code>argo list</code></p>
</li>
<li><p>To see info about a specific workflow, we use <code>argo get</code></p>
</li>
<li><p>To print logs from a workflow, we use <code>argo logs</code></p>
</li>
<li><p>Finally, to delete a workflow, we use <code>argo delete</code></p>
</li>
</ul>
<h1 id="heading-an-example-workflow">An example workflow</h1>
<p>The Argo docs provide an example “Hello World” workflow to get you started. Before we submit it though, let’s take a look at the YAML so we understand what it’s actually doing:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workflow</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">generateName:</span> <span class="hljs-string">hello-world-</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">workflows.argoproj.io/archive-strategy:</span> <span class="hljs-string">"false"</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">workflows.argoproj.io/description:</span> <span class="hljs-string">|
      This is a simple hello world example.
</span><span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">hello-world</span>
  <span class="hljs-attr">templates:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">hello-world</span>
    <span class="hljs-attr">container:</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
      <span class="hljs-attr">command:</span> [<span class="hljs-string">echo</span>]
      <span class="hljs-attr">args:</span> [<span class="hljs-string">"hello world"</span>]
</code></pre>
<p>As you can see, this is a <code>Workflow</code> type of object, as defined by one of the CRDs we just installed.</p>
<p>In the <code>metadata</code> section, we’re using <code>generateName</code> instead of <code>name</code>. This is actually a native Kubernetes feature that you don’t see too often in the wild. Using <code>generateName</code> here means that this object, when created, will use <code>hello-world-</code> as a prefix, and the server will add a unique generated suffix to its name.</p>
<p>Like any other Kubernetes object, the <code>spec</code> then defines the core structure of our workflow. In Argo, we provide this as a list of <code>templates</code> and an <code>entrypoint</code>. Templates are re-usable definitions of some sort of workflow logic which we’ll explore in more detail below, and the <code>entrypoint</code> is just which template we run first when the workflow starts.</p>
<h1 id="heading-templates">Templates</h1>
<p>So as we’ve just stated, templates are just re-usable sets of instructions. There are many different <em>types</em> of templates, but they all belong to one of these categories:</p>
<ul>
<li><p><strong>Template definitions</strong> are the types of templates that define work to be done, usually in some sort of container.</p>
</li>
<li><p><strong>Template invocators</strong> are ways to call other templates, and define execution control (such as ordering and dependencies).</p>
</li>
</ul>
<p>Coming back to our “Hello World” example, we are defining a single template called <code>hello-world</code>, and this template is a <code>container</code> type. This is the most commonly used template type in the <em>definition</em> category, and it simply lets you define a container to run in exactly the same way as you would anywhere else in Kubernetes (such as in a Pod spec).</p>
<p>The expectation is that the container will execute some command, optionally with some arguments, and then exit successfully. The output of the container is stored in an Argo variable, so that you could use it later if you wanted to (we’re not doing that here of course, because we only have a single template in this workflow).</p>
<h2 id="heading-submitting-a-workflow">Submitting a Workflow</h2>
<p>Let’s submit this example workflow to our Argo service. We’ll add the optional <code>--watch</code> flag to this command, so that we can watch the workflow complete:</p>
<pre><code class="lang-bash">argo submit -n argo --watch https://raw.githubusercontent.com/argoproj/argo-workflows/main/examples/hello-world.yaml
</code></pre>
<p>Watching the workflow shows us neatly what’s happening: the workflow executes successfully, and the watch exits when it’s finished.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754406348663/9f20bc5d-e6c2-4e7c-9b89-da9b25959b0b.png" alt class="image--center mx-auto" /></p>
<p>We can see that it took about 20 seconds for the workflow to finish successfully, and that a Pod called <code>hello-world-qnk9t</code> existed for about 9 seconds. Pretty cool!</p>
<h2 id="heading-other-template-types">Other Template Types</h2>
<p>While <code>containers</code> are most commonly used for templates, there are other types too within the definition category:</p>
<ul>
<li><p><code>script</code> is basically a convenience wrapper around <code>container</code>, which adds a <code>source</code> field to allow you to define a script in-place (for example, to run an in-line Python script on a <code>python:alpine</code> container)</p>
</li>
<li><p><code>resource</code> allows you to perform resource operations on a Kubernetes cluster, for example to create a <code>ConfigMap</code>. This is super useful when you realise you can pass variables and parameters between templates in a workflow.</p>
</li>
<li><p><code>suspend</code> allows you to suspend the execution of a workflow until it is resumed manually.</p>
</li>
<li><p><code>plugin</code> allows you to reference external plugins.</p>
</li>
<li><p><code>containerset</code> lets you use multiple containers within a single Pod.</p>
</li>
<li><p><code>http</code> lets you execute HTTP requests and store the results as a variable to use elsewhere.</p>
</li>
</ul>
<h1 id="heading-invocators-and-execution-control">Invocators and execution control</h1>
<p>Now we’re getting to the good stuff, and hopefully things will start to make more sense!</p>
<p>In the “Hello World” example, we have a single template acting as a single step in a workflow. This is a great starting point, but it can make the terminology confusing. Why is a step called a template?</p>
<p>The answer is that a template is designed to be re-used. We should use templates to create functions that can be referenced multiple times (applying the principle of <strong>Don’t Repeat Yourself</strong>).</p>
<p>We can then arrange <em>calls</em> to these functions using one of the <strong>invocator</strong> categories of templates.</p>
<p>So, within this category, you have 2 options:</p>
<p><code>steps</code> is the most straightforward invocator type, and lets you define your tasks in a series of steps. The steps will run one by one, but you can nest them - outer lists will run sequentially, and inner lists will run in parallel. There are also advanced synchronisation and conditional options for steps, but they’re a bit beyond our scope for today.</p>
<p>Here’s an example of using steps in a workflow:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workflow</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">generateName:</span> <span class="hljs-string">steps-example-</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">data-workflow</span>
  <span class="hljs-attr">templates:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">data-workflow</span>
    <span class="hljs-attr">steps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">step1</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">prepare-data</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">step2a</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">process-data</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">step2b</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">notify-data-team</span>
</code></pre>
<p>In this snippet, we create a workflow object with an <code>entrypoint</code> of <code>data-workflow</code>. The <code>data-workflow</code> template is a <code>steps</code> type, so it defines a set of steps to run.</p>
<ul>
<li><p><code>step1</code> runs first, and it calls a template called <code>prepare-data</code> (we would declare later on in this YAML file what <code>prepare-data</code> actually is and what it does, but we’re omitting it for now to keep this simple).</p>
</li>
<li><p>Once that step is completed, <code>step2a</code> and <code>step2b</code> run in parallel as they’re in a nested list. So the <code>process-data</code> and <code>notifiy-data-team</code> templates would both be called at the same time.</p>
</li>
</ul>
<p>If you have a more complex set of dependencies for steps in a workflow, you can specify these using a directed acyclic graph, or DAG, with the <code>dag</code> invocator template type. A DAG is basically a data structure that consists of nodes connected by directed edges, where the edges have a specific direction, and there are no cycles (you cannot return to a node once you leave it). They’re used a lot in workflow orchestration, and pop up in other tools like Apache Airflow.</p>
<p>Let’s look at this example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workflow</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">generateName:</span> <span class="hljs-string">dag-example-</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">data-workflow</span>
  <span class="hljs-attr">templates:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">data-workflow</span>
    <span class="hljs-attr">dag:</span>
      <span class="hljs-attr">tasks:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">A</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">echo</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">B</span>
        <span class="hljs-attr">dependencies:</span> [<span class="hljs-string">A</span>]
        <span class="hljs-attr">template:</span> <span class="hljs-string">echo</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">C</span>
        <span class="hljs-attr">dependencies:</span> [<span class="hljs-string">A</span>]
        <span class="hljs-attr">template:</span> <span class="hljs-string">echo</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">D</span>
        <span class="hljs-attr">dependencies:</span> [<span class="hljs-string">B</span>, <span class="hljs-string">C</span>]
        <span class="hljs-attr">template:</span> <span class="hljs-string">echo</span>
</code></pre>
<p>For the sake of simplicity, every step here is referencing the same dummy template called <code>echo</code> - the important thing is the order of how these steps are run, and how they depend upon each other.</p>
<ul>
<li><p><code>A</code> runs first</p>
</li>
<li><p><code>B</code> and <code>C</code> will run in parallel, as this is the default, but only if <code>A</code> has completed successfully because of the dependency we’ve defined.</p>
</li>
<li><p><code>D</code> can then run, but again only if its dependency on the successful completion of <code>B</code> and <code>C</code> is met.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754408991955/658adedf-f7a2-47d6-aa79-7e2a8d8c5a34.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-templates-and-parameters">Templates and parameters</h1>
<p>Passing data or artifacts between steps is super useful too. Let’s take a look at another <code>steps</code> example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workflow</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">generateName:</span> <span class="hljs-string">messages-</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">my-workflow</span>
  <span class="hljs-attr">templates:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">my-workflow</span>
    <span class="hljs-attr">steps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">hi-bob</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">print-message</span>
        <span class="hljs-attr">arguments:</span>
          <span class="hljs-attr">parameters:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">message</span>
            <span class="hljs-attr">value:</span> <span class="hljs-string">"Bob"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">hi-emily</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">print-message</span>
        <span class="hljs-attr">arguments:</span>
          <span class="hljs-attr">parameters:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">message</span>
            <span class="hljs-attr">value:</span> <span class="hljs-string">"Emily"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">hi-again</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">print-message</span>
        <span class="hljs-attr">arguments:</span>
          <span class="hljs-attr">parameters:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">message</span>
            <span class="hljs-attr">value:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{steps.hi-emily.outputs.result}}</span>"</span>

  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">print-message</span>
    <span class="hljs-attr">inputs:</span>
      <span class="hljs-attr">parameters:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">message</span>
    <span class="hljs-attr">container:</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
      <span class="hljs-attr">command:</span> [<span class="hljs-string">echo</span>]
      <span class="hljs-attr">args:</span> [<span class="hljs-string">"Hi <span class="hljs-template-variable">{{inputs.parameters.message}}</span>"</span>]
</code></pre>
<p>In this YAML our workflow object will get a name starting with <code>messages-</code> and we invoke the workflow with the <code>my-workflow</code> template (specified as our <code>entrypoint</code>). Also, <code>my-workflow</code> is of the <code>steps</code> type. Each step will run in order, as we don’t have any nested lists here. This should all be starting to make sense now 😄</p>
<p>Each step references the same template, called <code>print-message</code>. So we’re re-using that <em>function</em> multiple times. You can see it defined at the bottom of the YAML file, where it just uses the <code>busybox</code> container to echo a string to standard output. A nice convention here is to leave a blank line in the YAML before the <code>print-message</code> template, reminding us that it’s a function that gets <em>called</em> by steps in the workflow, but it’s not a part of the workflow on its own.</p>
<p>Let’s walkthrough <code>my-workflow</code>:</p>
<ul>
<li><p>In each step we’re demonstrating how we can parameterise our templates. In the <code>hi-bob</code> and <code>hi-emily</code> steps, we’re using arguments to pass a parameter to the template. The parameter is called <code>message</code>, and as you can see we can use any value we want in each step.</p>
</li>
<li><p>In the final step, rather than directly specify the value of message, we actually reference the output of the previous step!</p>
</li>
</ul>
<p>So what does this look like when we run it? Let’s give it a try. We’ll write this to a file called <code>messages.yaml</code> and submit it to Argo:</p>
<pre><code class="lang-bash">argo submit -n argo --watch messages.yaml
</code></pre>
<p>Here’s what the watch looks like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754410266891/7dd036ba-5ed8-4595-a05d-058cd1a05e0f.png" alt class="image--center mx-auto" /></p>
<p>We can also see the logs of a workflow like this:</p>
<pre><code class="lang-bash">argo logs -n argo @latest
</code></pre>
<p>This output is helpfully colour-coded! We can see the standard output of each step, saying “Hi Bob”, “Hi Emily” and of course, “Hi Hi Emily” (because we passed “Hi Emily” as the name to say Hi to!) 😄</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754410368063/3cab568c-5b32-42c1-ae76-216faf75cccd.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-please-tell-me-theres-a-ui">Please tell me there’s a UI</h2>
<p>Of course there is! We can port-forward the UI and then connect to <a target="_blank" href="https://localhost:2746">https://localhost:2746</a>. You will get some scary self-signed TLS errors however, and if you go on to use Argo in production you should definitely configure <a target="_blank" href="https://argo-workflows.readthedocs.io/en/latest/tls/">TLS properly</a>.</p>
<pre><code class="lang-bash">kubectl -n argo port-forward service/argo-server 2746:2746
</code></pre>
<h1 id="heading-summary">Summary</h1>
<p>We’ve now had a quick tour of Argo Workflows, and hopefully you’ve got a good idea of its basic functionality as well as its potential use cases. To get to grips with the advanced features of Argo, I’d definitely recommend browsing the <a target="_blank" href="https://argo-workflows.readthedocs.io/en/latest/workflow-concepts/">user guide</a> and playing with some more examples. Stay tuned, as next time we’ll be tackling Argo Rollouts!</p>
]]></content:encoded></item><item><title><![CDATA[What is Argo CD?]]></title><description><![CDATA[In my never-ending quest to Learn All The Things™, I’ve decided to take a tour through the Cloud Native Computing Foundation’s graduated projects page. I’m doing this for a few different reasons:

Learning new stuff is hip and cool 😎

The CNCF lands...]]></description><link>https://timberry.dev/what-is-argo-cd</link><guid isPermaLink="true">https://timberry.dev/what-is-argo-cd</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[argo]]></category><category><![CDATA[CNCF]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Mon, 28 Jul 2025 14:51:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754323756653/5ddc2a13-51d6-4ed4-a3f8-14fc55eff307.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my never-ending quest to Learn All The Things™, I’ve decided to take a tour through the Cloud Native Computing Foundation’s <a target="_blank" href="https://www.cncf.io/projects/">graduated projects</a> page. I’m doing this for a few different reasons:</p>
<ol>
<li><p>Learning new stuff is hip and cool 😎</p>
</li>
<li><p>The CNCF landscape looks scary! But learning the graduated projects one by one seems manageable and useful.</p>
</li>
<li><p>The whole CNCF project is truly inspiring. It’s a real community effort contributed to by thousands of smart people that builds actually useful things together!</p>
</li>
</ol>
<p>Graduated projects are considered stable and are used successfully in production environments (so the CNCF website proudly declares), so I’m going to explore each one in turn in this series. The first one on this list is:</p>
<h1 id="heading-argo">Argo</h1>
<p>And Argo is actually more than one thing! Argo defines itself as “Kubernetes-native tools to run workflows, manage clusters, and do GitOps right”. Argo was accepted to CNCF on March 26, 2020 at the Incubating maturity level and then moved to the Graduated maturity level on December 6, 2022. Here we are some years later, and the Argo project has thousands of users and some impressive production case studies.</p>
<p>Argo now comprises 4 different tools:</p>
<ul>
<li><p><strong>Argo Workflows</strong>: Kubernetes-native workflow engine supporting DAG and step-based workflows.</p>
</li>
<li><p><strong>Argo CD</strong>: Declarative continuous delivery with a fully-loaded UI.</p>
</li>
<li><p><strong>Argo Rollouts</strong>: Advanced Kubernetes deployment strategies such as Canary and Blue-Green made easy.</p>
</li>
<li><p><strong>Argo Events</strong>: Event based dependency management for Kubernetes.</p>
</li>
</ul>
<p>Taking things slightly out of order, I’m going to start by looking at Argo CD, as this is the project’s most popular tool, and it’ll give us a good understanding of the project’s conventions we can then take forward to learn about the other tools. Let’s go!</p>
<h1 id="heading-argo-cd">Argo CD</h1>
<p>So what is Argo CD? In a nutshell, Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. The idea is that while native Kubernetes brings the declarative model to its foundational building blocks, Argo CD extends that approach so that everything you need to deploy applications including configurations and environments are also declarative and version controlled. Treating everything in the deployment lifecycle as software makes it more reliable and easy to collaborate on.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753695419918/26644e3a-151c-4be0-afb5-cfe4e8ad6e74.gif" alt class="image--center mx-auto" /></p>
<p>Argo introduces its own opinionated concepts about how to get stuff done in Kubernetes. There are quite a few of these, but the core ones you need to understand are:</p>
<ul>
<li><p>An <strong>Application</strong> is a group of Kubernetes resources defined by a manifest. Following best practices, we tend to break down things we run in Kubernetes into their components, such as an individual microservice, data store or messaging queue. But even these things require multiple native Kubernetes objects. An Argo Application then groups all the necessary resource objects together to form a single deployable component.</p>
</li>
<li><p>Applications have an <strong>Application Source Type</strong>, which is determined by the <strong>Tool</strong> used to build them. Essentially the <strong>Tool</strong> is just something you use to process Kubernetes manifests and make any necessary changes. Commonly used tools are <a target="_blank" href="https://kustomize.io/">Kustomize</a> and <a target="_blank" href="https://helm.sh/">Helm</a>.</p>
</li>
</ul>
<p>As a declarative system, ArgoCD cares about the state of your applications. So when dealing with Argo lifecycles we need to understand these state definitions:</p>
<ul>
<li><p>The <strong>Target State</strong> is the desired state of an <strong>Application</strong>, declared in code in a Git repo</p>
</li>
<li><p>The <strong>Live State</strong> is the current state of the <strong>Application</strong> (eg. what Pods are currently deployed)</p>
</li>
<li><p>A <strong>Sync</strong> is the process of moving an <strong>Application</strong> to its <strong>Target State</strong> (ie. by making changes to a Kubernetes cluster)</p>
</li>
<li><p>A <strong>Sync Status</strong> represents whether or not the <strong>Live State</strong> matches the <strong>Target State</strong>.</p>
</li>
</ul>
<p>With those concepts in mind, let’s walk through Argo CD’s Getting Started Guide. You can of course follow the guide on Argo’s website, but it’s fairly bare-bones! In this post I want to try to add some more detail and explanation to what we’re doing, so we really understand the purpose of using ArgoCD.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you’ll need access to a Kubernetes cluster. A simple local dev cluster will do, such as you might run with <a target="_blank" href="https://kind.sigs.k8s.io/">Kind</a> or <a target="_blank" href="https://minikube.sigs.k8s.io/docs/">Minikube</a>.</p>
<h2 id="heading-installing-argo-cd">Installing Argo CD</h2>
<p>To get started, we’ll run the following commands to create a dedicated namespace called <code>argocd</code> and install the necessary CRDs:</p>
<pre><code class="lang-bash">kubectl create ns argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
</code></pre>
<p>It might not feel like a good practice to put everything in a single namespace right now, but you can consider <code>argocd</code> to be your <strong>Control Plane</strong> namespace. While your Application CRDs may get deployed here, the actual Kubernetes resources that they create and manage can be directed to other <strong>Data Plane</strong> namespaces, which can be designed using the usual methods for application isolation.</p>
<p>In one hopefully trustworthy command, we’ve now installed around 60 new resources to our Kubernetes cluster. This includes CRDs to support Argo Applications, along with Roles, ClusterRoles, RoleBindings, ClusterRoleBindings, and the necessary Deployments, Services and ConfigMaps to support the key moving parts of ArgoCD:</p>
<ul>
<li><p><strong>argocd-server</strong>: This is the API server. It exposes the gRPC/REST API that the Argo CD CLI and UI consume. It's responsible for managing applications, projects, and authentication.</p>
</li>
<li><p><strong>argocd-repo-server</strong>: This server is responsible for cloning your Git repositories, caching them locally, and generating the Kubernetes manifests from the source (e.g., by running <code>kustomize</code> or hydrating a Helm chart).</p>
</li>
<li><p><strong>argocd-application-controller</strong>: This is the core controller that continuously monitors the live state of your applications against the desired state defined in Git. When a difference is detected (<code>OutOfSync</code>), this controller is what executes the sync operations.</p>
</li>
<li><p><strong>argocd-dex-server</strong>: Dex is an identity service that uses OpenID Connect (OIDC). Argo CD includes it by default to handle authentication by federating identity from other providers like SAML, LDAP, or social logins (e.g., GitHub, Google).</p>
</li>
<li><p><strong>argocd-redis</strong>: This deployment runs a Redis cache. Argo CD uses it extensively for storing the application state cache, OIDC tokens, and other temporary data to improve performance and reduce requests to the Kubernetes API server.</p>
</li>
<li><p><strong>argo-notifications-controller</strong>: An optional controller that monitors the Application resources and, based on triggers you configure, sends notifications about application health and sync status to services like Slack, Email, and (if you’re unlucky) Microsoft Teams.</p>
</li>
</ul>
<p>Next we’ll install the <code>argocd</code> CLI. You can download the latest release from <a target="_blank" href="https://github.com/argoproj/argo-cd/releases/latest">https://github.com/argoproj/argo-cd/releases/latest</a> or if you’re using Homebrew, go ahead and run:</p>
<pre><code class="lang-bash">brew install argocd
</code></pre>
<p>We can now use the CLI to log into our ArgoCD service:</p>
<pre><code class="lang-bash">argocd login --core
</code></pre>
<p>Finally, let’s set the default namespace for our current Kubernetes context. This just saves us some time because for the purposes of this demo, we’re running Argo and our applications on the same cluster. (Note: Using other clusters is super easy and can be managed with the <a target="_blank" href="https://argo-cd.readthedocs.io/en/stable/user-guide/commands/argocd_cluster/">argocd cluster</a> command)</p>
<pre><code class="lang-bash">kubectl config set-context --current --namespace=argocd
</code></pre>
<h2 id="heading-creating-a-sample-application">Creating a sample Application</h2>
<p>The Argo project provides a number of sample applications for us to play with when we’re learning how ArgoCD works. We’ll start with the basic guestbook example. You can view the components at <a target="_blank" href="https://github.com/argoproj/argocd-example-apps/tree/master/guestbook">https://github.com/argoproj/argocd-example-apps/tree/master/guestbook</a></p>
<p>As you can see, it’s just a simple <code>Deployment</code> and <code>Service</code>.</p>
<p>To create an Argo application using this repo, we’ll run this command:</p>
<pre><code class="lang-bash">argocd app create guestbook \
  --repo https://github.com/argoproj/argocd-example-apps.git \
  --path guestbook \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace default
</code></pre>
<p>Let’s break that down:</p>
<ul>
<li><p><a target="_blank" href="https://argo-cd.readthedocs.io/en/stable/user-guide/commands/argocd_app_create/">argocd app create</a> is the command to create a new ArgoCD <strong>Application</strong>. We specify <code>guestbook</code> as the application name, and then provide some flags:</p>
</li>
<li><p><code>--repo</code> sets the repository source where the resource definitions can be found. We could have alternatively used local files.</p>
</li>
<li><p><code>--path</code> identifies the location within that repo for our resources, and in this case that’s the <code>guestbook</code> directory within the repo.</p>
</li>
<li><p><code>--dest-server</code> and <code>--dest-namespace</code> identify the destination Kubernetes cluster and namespace respectively (you probably figured that one out yourself 😁)</p>
</li>
</ul>
<p>When this command completes, we have created the declarative definition of our new ArgoCD Application. Let’s check on its status with this command:</p>
<pre><code class="lang-bash">argocd app get guestbook
</code></pre>
<p>Hmm… the <strong>Health Status</strong> is <code>Missing</code> and the <strong>Sync Status</strong> is <code>OutOfSync</code>!</p>
<p>That’s to be expected, because we need to perform the first <strong>Sync</strong>. Recall from earlier that the <strong>Sync</strong> process should align our <strong>Live State</strong> with our <strong>Target State.</strong> Right now, our <strong>Live State</strong> is that nothing has been deployed, so the <strong>Sync</strong> should fix that!</p>
<pre><code class="lang-bash">argocd app sync guestbook
</code></pre>
<p>If you check the status again after running this command, you should see that the <strong>Sync</strong> operation has succeeded, and the <strong>Health Status</strong> is <code>Progressing</code>. In other words, ArgoCD is happy in what it needs to do to perform the <strong>Sync</strong>, and it is currently underway.</p>
<p>After a few moments, the <strong>Health Status</strong> should change to <code>Healthy</code> because all the actions will have been completed, and the new guestbook <strong>Application</strong> is running as expected.</p>
<h2 id="heading-making-and-syncing-changes">Making and syncing changes</h2>
<p>The job of declarative systems like ArgoCD is to make sure that the live state matches the state you are declaring in code. So let’s say we make a change to our git repo, what happens next? (Note: You can try this for yourself by forking the Argo examples repo and then making some changes).</p>
<p>The <code>argocd-application-controller</code> performs a reconciliation loop to ensure that the live state still matches the state declared in code. It relies on the <code>argocd-repo-server</code> to monitor the git repo associated with an <strong>Application</strong> for any changes (by default it checks around every 3 minutes). A <code>diff</code> is performed, and any kind of changes (such as a new container image, or change in replica count) will mean that the <strong>Application</strong>’s <strong>Health Status</strong> gets changed to <code>OutOfSync</code>.</p>
<p>By default, it’s then up to the operator to perform another sync command, which will allow ArgoCD to reconcile the live status with the changes.</p>
<p>Optionally, you can configure an automated <code>syncPolicy</code> in your <strong>Application</strong> definition. In this policy you can also enable <code>selfHeal</code>, which will allow ArgoCD to automatically revert back to the state defined in git if anyone makes any manual changes directly on the cluster. Neat!</p>
<h2 id="heading-using-helm">Using Helm</h2>
<p>One of the primary benefits of an automated CD solution like this is to automate changes across different application environments (such as dev to production). Let’s walk through a super simple example that leverages Helm and manages different versions of an application. We’ll use the sample code from the ArgoCD repo here: <a target="_blank" href="https://github.com/argoproj/argocd-example-apps/tree/master/helm-guestbook">https://github.com/argoproj/argocd-example-apps/tree/master/helm-guestbook</a></p>
<p>Note that Helm is only used here to inflate a chart using <code>helm template</code> under the hood. When coupled with ArgoCD, we’re letting Argo handle the lifecycle of the component itself, not Helm. We’re basically taking the nice and easy templating stuff from Helm (because Kustomize is weird!) but making our deployments continuous with ArgoCD.</p>
<p>So first we’ll create 2 different namespaces, one for development and one for production:</p>
<pre><code class="lang-bash">kubectl create ns guestbook-dev
kubectl create ns guestbook-prod
</code></pre>
<p>Now we can create an ArgoCD <strong>Application</strong> for development:</p>
<pre><code class="lang-bash">argocd app create guestbook-dev \
  --repo https://github.com/argoproj/argocd-example-apps.git \
  --path helm-guestbook \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace guestbook-dev \
  --sync-policy automated --auto-prune --self-heal \
  --values values.yaml
</code></pre>
<p>Notice here we are enabling automated syncs, and automated healing. Our development environment will now stay dynamically up to date with the latest version of our resources committed to our git repo.</p>
<p>Let’s create the production version of this <strong>Application</strong>, but this time we’ll default to manual synchronisation and we’ll specify the <code>values-production.yaml</code> file, providing some environment-specific overrides:</p>
<pre><code class="lang-bash">argocd app create guestbook-prod \
  --repo https://github.com/argoproj/argocd-example-apps.git \
  --path helm-guestbook \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace guestbook-prod \
  --values values-production.yaml
</code></pre>
<p>Now if we run <code>argocd app get</code> on each app, we’ll see that both have been created, and the development app has automatically synced. We can go ahead and sync the production app manualy with <code>argocd app sync</code> like we did earlier.</p>
<h2 id="heading-is-there-a-fancy-ui">Is there a fancy UI?</h2>
<p>Yes, there is! The first time you access it, you’ll need to retrieve the default admin password. You can do this with this command:</p>
<pre><code class="lang-bash">argocd admin initial-password -n argocd
</code></pre>
<p>Now we’ll use port-forwarding to connect to the web UI (other options, like <a target="_blank" href="https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/">ingress</a>, are available):</p>
<pre><code class="lang-bash">kubectl port-forward svc/argocd-server -n argocd 8080:443
</code></pre>
<p>Once you’ve logged in you can explore all of your <strong>Application</strong>s, states, syncs and revisions through the UI. You should also change the admin password, and delete the <code>argocd-initial-admin-secret</code> from the Argo CD namespace once you have done so.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753713613026/f00fd993-b37d-42d4-aae2-5a6223265fa4.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753713623912/83304b0e-35d2-46ec-a152-95866196eb4d.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-summary">Summary</h1>
<p>That’s the end of our quick tour of ArgoCD, but hopefully I gave you a bit more to go on that the <a target="_blank" href="https://argo-cd.readthedocs.io/en/stable/">official docs</a>, which are very good if a little overwhelming! Hopefully I can keep the momentum of this series going, and in the next post I’ll be tacking Argo Workflows. See you then!</p>
]]></content:encoded></item><item><title><![CDATA[DevOps, GitOps and CI/CD with GKE]]></title><description><![CDATA[This is the tenth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to supp...]]></description><link>https://timberry.dev/devops-gitops-and-cicd-with-gke</link><guid isPermaLink="true">https://timberry.dev/devops-gitops-and-cicd-with-gke</guid><category><![CDATA[Devops]]></category><category><![CDATA[gitops]]></category><category><![CDATA[cicd complete proccess]]></category><category><![CDATA[gke]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[google cloud]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Mon, 24 Feb 2025 12:31:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398888073/a3d2bf66-ca0f-41e0-bb7a-c6a2fd5a25c5.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the tenth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>Learning how complex systems work is often an exploratory procedure. We need to slowly and methodically explore new features, learn their dependencies, and understand how they operate holistically with other components. This ultimately provides us with the knowledge to fully understand a system and to design architectures that use it. And that’s basically what I’ve been trying to do in this blog series!</p>
<p>But working manually is not a recommended way to manage production systems once they have been deployed. When organisations rely on critical infrastructure, we need build and deployment methods that are reliable, repeatable and auditable, and none of these things apply to iterative manual work. And that’s why the concept of <strong>Infrastructure as Code</strong> has long been established in IT to achieve these goals.</p>
<p>We can define infrastructure in code and rely on that code to be deterministically repeatable. We can even audit that code, as it provides a living documentation of what’s being built. Extending those principles to “Everything as Code”, the way we deploy applications can also be automated using many of the same approaches and tools. Giving us faster and safer deployment methods means we can iterate on software rapidly but reliably.</p>
<p>Software lifecycles are now managed almost exclusively through CI/CD pipelines. This approach combines <strong>Continuous Integration</strong> (CI), where developers synchronize code changes in a central repository as frequently as possible, and <strong>Continuous Delivery</strong> (CD), where new updates are released into a production environment. We’ve come a long way from the 6-month release cycles of a decade ago, to many cloud-native organisations deploying small code changes hundreds of times a day.</p>
<p>So in this post, we’re going to learn about the GKE tooling that we can use to achieve this sort of developer agility. We’ll be looking at:</p>
<ul>
<li><p>An overview of DevOps and GitOps practices</p>
</li>
<li><p>Building a basic CI/CD architecture with Cloud Build and Cloud Deploy</p>
</li>
<li><p>Some considerations for private build infrastructure</p>
</li>
</ul>
<p>By the end of this post, you should feel comfortable automating the complete lifecycle of your workloads on GKE!</p>
<p><strong><em>Just one quick note before we get started:</em></strong> <em>This post covers native CI/CD tools for GKE to help you build a software delivery framework – but we’re talking about your workloads here, not your clusters. It is also recommended that your GKE infrastructure is automated through Infrastructure as Code, as I touched on at the start of this series. We can’t get side-tracked into a large Terraform tutorial here though, so if you need some further reading on this topic, take a look at:</em> <a target="_blank" href="https://cloud.google.com/docs/terraform"><em>https://cloud.google.com/docs/terraform</em></a></p>
<h2 id="heading-an-overview-of-devops-and-gitops-practices">An overview of DevOps and GitOps practices</h2>
<p>There’s a lot of debate about the true definition of the term “DevOps” (which comes from the contraction of Developer and Operations). In a nutshell, DevOps is a methodology. It’s a change to the way we work – the processes we follow and the tools we use, to introduce elements of the software engineering world into how we build and maintain infrastructure. GitOps is simply an extension of DevOps that implements automation based around a git repository as a single source of truth.</p>
<p>But why do we need DevOps? Why are automation and speed so important? While there’s definitely a business case for improving developer agility and getting new features to market faster than your competitors, speed is really just a benefit of doing DevOps properly – it's not a requirement for doing it in the first place. If you can trust your platform to allow you to iterate quickly, this means that it must be a reliable platform, and a reliable platform is much easier to fix when things go wrong.</p>
<p>So how does automation give us a reliable platform? Once again, it’s about removing the human variable from the equation. Historically, production computer systems could be considered to be in a fairly fragile state. Operators would dread having to update applications, not least because the updated code had probably been thrown over a wall by a development team with little knowledge of the actual server where it was intended to end up. Installing an update meant changing the state of a system that had been established by a hundred different manual interactions over the previous years. Even worse, in the event of a failure or disaster recovery scenario, how do you get that server back to that unknowable state?</p>
<p>As mentioned way back at the start of this series, the move to declarative configuration was a paradigm shift in the way we manage systems. If we trust that our declarative tools can maintain a deterministic state of a system, suddenly we’re not afraid to change that state. Rolling back a failed update or recovering from a failure simply means referring to an earlier documented state. And now we have modern tooling that achieves this model across infrastructure and application deployments.</p>
<h3 id="heading-a-software-delivery-framework">A software delivery framework</h3>
<p>While real-life implementations will vary considerably based on team sizes, workloads and even organisation policies, we can describe an ideal architecture for a software delivery framework as illustrated below. Later on we’ll walk through actually building this reference architecture.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740046939720/99815a82-277f-4d89-a644-53a4aa10d0e1.png" alt class="image--center mx-auto" /></p>
<p>The foundation for the architecture is using <strong>git</strong> as a source of truth and basing all actions within the architecture (deploying or updating infrastructure or applications) as actions that are triggered by events in those git repositories. Manual changes are expressly forbidden by design, and this is achieved by restricting permissions in production environments to service accounts operated by the pipelines only. It’s worth noting that within those git repositories there are still many different options for how to collaborate and construct your workflows. That level of detail is outside our scope here, but if you’re interested in comparing git workflows such as Feature Branch and Forking, a good resource is Atlassian’s documentation here: <a target="_blank" href="https://www.atlassian.com/git/tutorials/comparing-workflows">https://www.atlassian.com/git/tutorials/comparing-workflows</a></p>
<p>Our architecture is separated by two distinct roles: Developer and Operator (these days, Operators might be called Platform Engineers). The Developer is responsible for developing and building applications and maintaining them in the Application Repo. The Operator is responsible for building infrastructure and maintaining this code in the Infrastructure Repo. Of course, depending on the size and layout of your teams, these could also be the same people in a single SRE team, but the important thing is the separation of duties in the respective pipelines. Let’s walk through them!</p>
<ul>
<li><p><strong>The Developer</strong> commits code that represents a new version of an application to a git repo (each application or microservice is likely to have its own repo).</p>
<ul>
<li><p>This commit triggers a CI build pipeline.</p>
</li>
<li><p>The CI system will build a new artifact from the updated code (usually a Docker container).</p>
</li>
<li><p>Tests can and should be run at various stages of this process as well, from tests on the uncompiled code to tests on the final container image. If tests pass successfully, the image can be pushed into an artifact registry.</p>
</li>
<li><p>This in turn triggers the CD system, which deploys the application to a development cluster, or updates an existing deployment in place with the new container.</p>
</li>
<li><p>Using the CD system, Developers or Operators can monitor the state of deployments and choose to promote them through environments – from development to staging, and from staging to production.</p>
</li>
<li><p>Each environment uses its own specific configuration, also retrieved from the git source repository.</p>
</li>
</ul>
</li>
<li><p>Meanwhile, <strong>the Operator</strong> is responsible for building and maintaining the cluster infrastructure itself, rather than the applications that run on it.</p>
<ul>
<li><p>Infrastructure code belongs in its own git repo, and changes to infrastructure are handled through a different CI process.</p>
</li>
<li><p>Normally it's sufficient to use a CI tool such as Cloud Build to manage changes to infrastructure code (like Terraform).</p>
</li>
</ul>
</li>
</ul>
<p>Now we’ve discussed all of the theory, let’s go ahead and put all of this infrastructure in place!</p>
<h2 id="heading-building-a-basic-cicd-architecture-with-cloud-build-and-cloud-deploy">Building a basic CI/CD architecture with Cloud Build and Cloud Deploy</h2>
<p>We'll now walk through building a simple reference architecture that uses a GitOps approach to building an application (or updating an existing build), and then automatically deploying that application to a staging environment, where it can be manually promoted to a production environment. All the sample code below can be found in my repo here: <a target="_blank" href="https://github.com/timhberry/gke-cicd-demo">https://github.com/timhberry/gke-cicd-demo</a></p>
<p>We’re going to concentrate on the Application Repo part of the diagram we looked at earlier. So to follow along you’ll need to have two GKE clusters set up and ready to go. In our code we refer to them as <code>staging-cluster</code> and <code>prod-cluster</code>, and both run in <code>us-central1-c</code>. Keep an eye out for these references in the code if you need to change them for your own environment. You will also need a git repo to host your application code. In the walkthrough we’ll use Github, but Bitbucket is also supported.</p>
<p>Before we get started, let’s look at an overview of the layout we want to build for our application repo. For demonstration purposes, our app will be a simple “Hello World” web server. To provide the CI/CD automation we need, we’ll bundle the following files along with our code in our repo:</p>
<ul>
<li><p><code>cloudbuild.yaml</code>: The configuration for Cloud Build. This will define how the CI pipeline builds our application and triggers the next stage.</p>
</li>
<li><p><code>clouddeploy.yaml</code>: The configuration for Cloud Deploy. This file defines how our CD pipelines should work, managing deployments to our staging and production clusters.</p>
</li>
<li><p><code>skaffold.yaml</code>: The configuration for Skaffold, an automation framework for Kubernetes used by Cloud Deploy (don’t worry, we’ll explain this in a moment!)</p>
</li>
<li><p><code>k8s/deployment.yaml</code> and <code>k8s/service.yaml</code>: Definitions for the Kubernetes objects our application needs.</p>
</li>
<li><p><code>Dockerfile</code>: The container manifest for building our image.</p>
</li>
</ul>
<p>We’ll build up the repo piece by piece in the following sections.</p>
<h3 id="heading-setting-up-permissions-for-cloud-build">Setting up permissions for Cloud Build</h3>
<p>Before we can start using our repo, we need to provide some permissions to the Cloud Build service account. In the following commands, you’ll need your 10-digit project number, as it forms part of the server account name. You can find your project number on the dashboard of the Cloud Console, or by using the following command (replacing <code>&lt;PROJECT-NAME&gt;</code> with your project name):</p>
<pre><code class="lang-bash">gcloud projects describe &lt;PROJECT-NAME&gt; \
    --format=<span class="hljs-string">"value(projectNumber)"</span>
</code></pre>
<p>In the following commands, we’ll use the fake project number <code>5556667778</code>, so just make sure to substitute yours (along with your <code>&lt;PROJECT-NAME&gt;</code>). First, we give Cloud Build permission to create releases in Cloud Deploy:</p>
<pre><code class="lang-bash">gcloud projects add-iam-policy-binding &lt;PROJECT-NAME&gt; \
    --member=5556667778@cloudbuild.gserviceaccount.com \
    --role=<span class="hljs-string">"roles/clouddeploy.releaser"</span>
</code></pre>
<p>Next, we need to make sure that Cloud Build can act as the default Compute Engine service account. This requires roles to act as a service account and create service account tokens:</p>
<pre><code class="lang-bash">gcloud projects add-iam-policy-binding &lt;PROJECT-NAME&gt; \
  --member serviceAccount:5556667778@cloudbuild.gserviceaccount.com \
  --role roles/iam.serviceAccountTokenCreator

gcloud iam service-accounts add-iam-policy-binding \
     5556667778-compute@developer.gserviceaccount.com \
     --member serviceAccount: 5556667778@cloudbuild.gserviceaccount.com  \
     --role roles/iam.serviceAccountUser
</code></pre>
<p>Note that this isn’t exactly a best practice – ideally, we would create IAM policy bindings with the least privileges required for our pipelines, and not use the default service account. However, we’re trying to learn Cloud Deploy, and pages and pages of IAM instructions would probably distract from that!</p>
<p>With IAM out of the way, we can now start to set up our repo.</p>
<h3 id="heading-creating-a-sample-app-with-automated-builds">Creating a sample app with automated builds</h3>
<p>To demonstrate our CI/CD pipeline, we’ll build a very simple application that will be easy to update so we can experiment with the update process. All the cool kids are using Golang these days, so I’ve built a small app in Python which you can find in the repo as <code>server.py</code>. In addition, we have a <code>Dockerfile</code> that specifies how to build the container image, a simple HTML <code>template.html</code>, and a <code>requirements.txt</code> file that Python uses to install the libraries it needs.</p>
<p>If you’ve checked out the repo locally, you can test the image on your own machine by building it and running it with Docker (or Podman):</p>
<pre><code class="lang-bash">docker build -t sample-app .
docker run -d -p 8080:8080 sample-app
</code></pre>
<p>When the container is running, try to load <a target="_blank" href="http://localhost:8080">http://localhost:8080</a> in your browser and you should see a “Hello World!” message.</p>
<p>In our pipeline, Cloud Build will be responsible for building our image, tagging it, and storing it in the registry. Let’s create a basic <code>cloudbuild.yaml</code> file to do this for us:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">Build</span> <span class="hljs-string">container</span> <span class="hljs-string">image</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">'gcr.io/cloud-builders/docker'</span>
  <span class="hljs-attr">args:</span> [<span class="hljs-string">'build'</span>, <span class="hljs-string">'-t'</span>, <span class="hljs-string">'gcr.io/$PROJECT_ID/sample-app:$COMMIT_SHA'</span>, <span class="hljs-string">'.'</span>]

<span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">Push</span> <span class="hljs-string">to</span> <span class="hljs-string">Artifact</span> <span class="hljs-string">Registry</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">'gcr.io/cloud-builders/docker'</span>
  <span class="hljs-attr">args:</span> [<span class="hljs-string">'push'</span>, <span class="hljs-string">'gcr.io/$PROJECT_ID/sample-app:$COMMIT_SHA'</span>]
</code></pre>
<p>Once you have this file and your app code committed to your repo, it’s time to set up a Cloud Build trigger. This enables Cloud Build to run automatically whenever a new commit is made to your repo.</p>
<p>By default, Cloud Build will watch the main branch of your repo. In a real-world situation, you may want to build from feature branches or use PRs to merge to the main branch before a build is triggered. This will depend on your chosen Git workflow, and once again, I’m trying to keep this walkthrough as simple as possible!</p>
<p>From the Cloud Build page in the Cloud Console, select <strong>Triggers</strong> and <strong>Create Trigger</strong>. You’ll need to give your trigger a name and then choose a repository. You'll have the option to connect a new repository here, which will forward you to Github to provide permission to connect Cloud Build. Once your repository is selected, we can leave everything else about the trigger at a default and click <strong>Create</strong>.</p>
<p>You should now see your trigger configured, as shown below. Note that the event is “Push to branch”, meaning that a push to the specified branch will trigger Cloud Build. The Build configuration is set to “Auto-detected”, which means that Cloud Build will look for a <code>cloudbuild.yaml</code> file, but if it can’t find one it will also look for a <code>Dockerfile</code> or a <code>buildpacks</code> configuration. A <code>cloudbuild.yaml</code> file will always take priority.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740396873662/403a5432-d721-4ecc-b26e-0f3e6c0167f3.png" alt class="image--center mx-auto" /></p>
<p>You can now go ahead and commit your code to your repo. Remember this should include <code>server.py</code>, <code>Dockerfile</code>, <code>template.html</code>, <code>requirements.txt</code> and <code>cloudbuild.yaml</code>. If you go back to your Cloud Build dashboard, you should now see that a build has been triggered. A short time later, you’ll see that Cloud Build has completed the steps that we defined: it has built your container image and stored it in the registry. You can select the build to see the completed build steps, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740396918319/8dfedf4a-0720-400b-a7f2-b85f0543f335.png" alt class="image--center mx-auto" /></p>
<p>We’ve now achieved the <strong>CI</strong> in <strong>CI/CD</strong>, as we have an automated workflow that continuously integrates changes to our code and automatically generates a new container image every time we do so. Next, we’ll add our Kubernetes objects and configure a pipeline to continuously deploy our changes as well.</p>
<h3 id="heading-adding-kubernetes-manifests">Adding Kubernetes manifests</h3>
<p>Our application will ultimately get deployed to our GKE clusters, so we need to add some Kubernetes objects to our repo. In a sub-directory called <code>k8s/</code> we’ll create a <code>deployment.yaml</code> and <code>service.yaml</code> file. The <code>service.yaml</code> file is very straightforward and simply creates a <code>LoadBalancer</code> service for our app, exposing HTTP port 80:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sample-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">sample-app</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
    <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
    <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>The <code>deployment.yaml</code> is also very straightforward at first glance:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">sample-app</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sample-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span> 
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">sample-app</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">sample-app</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">image:</span> <span class="hljs-string">_IMAGE_NAME</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">sample-app</span>
        <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">Always</span>
</code></pre>
<p>Note the container image name however: <code>_IMAGE_NAME</code>. This is a placeholder! In a moment, we’ll update our Cloud Build steps so that the deployment YAML is overwritten with the latest build reference. Add these manifest files to your repo.</p>
<p><strong><em>Let’s keep this simple!</em></strong> <em>There's lots of moving parts here and we’re trying to keep the focus on Cloud Deploy. Overwriting a deployment YAML in our CI process is a very simplistic approach, but there’s nothing wrong with a bit of simplicity! Cloud Deploy also supports complex approaches with tools like Kustomize and Helm, if you’re already using these to build and deploy your applications. For more information see</em> <a target="_blank" href="https://cloud.google.com/deploy/docs/using-skaffold/managing-manifests"><em>https://cloud.google.com/deploy/docs/using-skaffold/managing-manifests</em></a></p>
<h3 id="heading-configuring-cloud-deploy-pipelines-and-releases">Configuring Cloud Deploy pipelines and releases</h3>
<p>Now we can add the configuration required for Cloud Deploy. There are two distinct concepts in Cloud Deploy we need to understand:</p>
<ul>
<li><p>A <strong>pipeline</strong> is the configuration of how our CD process should run. It defines the targets we deploy to, what parameters they should use and other variables such as approval stages.</p>
</li>
<li><p>A <strong>release</strong> specifies a single run through the pipeline CD process that releases artifacts to targets. This is usually triggered by the end of a CI process.</p>
</li>
</ul>
<p>We’ll create the pipeline manually first (although technically you could do this with infrastructure as code too!), and in a moment we’ll rely on Cloud Build to run a command that triggers a release. Here’s our <code>clouddeploy.yaml</code> file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">deploy.cloud.google.com/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">DeliveryPipeline</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sample-app</span>
<span class="hljs-attr">description:</span> <span class="hljs-string">sample</span> <span class="hljs-string">app</span> <span class="hljs-string">pipeline</span>
<span class="hljs-attr">serialPipeline:</span>
  <span class="hljs-attr">stages:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">targetId:</span> <span class="hljs-string">staging</span>
    <span class="hljs-attr">profiles:</span> []
  <span class="hljs-bullet">-</span> <span class="hljs-attr">targetId:</span> <span class="hljs-string">prod</span>
    <span class="hljs-attr">profiles:</span> []
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">deploy.cloud.google.com/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Target</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">staging</span>
<span class="hljs-attr">description:</span> <span class="hljs-string">staging</span> <span class="hljs-string">cluster</span>
<span class="hljs-attr">gke:</span>
  <span class="hljs-attr">cluster:</span> <span class="hljs-string">projects/PROJECT_ID/locations/us-central1-c/clusters/staging-cluster</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">deploy.cloud.google.com/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Target</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">prod</span>
<span class="hljs-attr">description:</span> <span class="hljs-string">production</span> <span class="hljs-string">cluster</span>
<span class="hljs-attr">gke:</span>
  <span class="hljs-attr">cluster:</span> <span class="hljs-string">projects/PROJECT_ID/locations/us-central1-c/clusters/prod-cluster</span>
</code></pre>
<p>In this file we define a pipeline with two stages: <strong>staging</strong> and <strong>production</strong>. We then create and associate targets with each stage, and within each target we reference our GKE clusters (make sure you update your zones and references to <code>PROJECT_ID</code>). This is a simple example, but we could expand on it to provide differing values or parameters to be applied to each target environment.</p>
<p>We can now create the pipeline with this command:</p>
<pre><code class="lang-bash">gcloud deploy apply --file=clouddeploy.yaml \
  --region=us-central1
</code></pre>
<p>In the Cloud Deploy section of the Cloud Console, we should now see an empty delivery pipeline, as shown below. At this stage we’ve configured the pipeline, but nothing has run through it yet. But we’re nearly there!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740397851055/818ef25f-8b46-4a53-95c1-3071233b5c65.png" alt class="image--center mx-auto" /></p>
<p>Under the hood, Cloud Deploy leverages an automation framework called <strong>Skaffold</strong>. This framework manages Kubernetes manifests for Cloud Deploy, rendering them into their required state for deployment. So, we need to add a very simple <code>skaffold.yaml</code> to our repo:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">skaffold/v4beta7</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Config</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sample-app</span>
<span class="hljs-attr">manifests:</span>
  <span class="hljs-attr">rawYaml:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">k8s/*</span>
<span class="hljs-attr">deploy:</span>
  <span class="hljs-attr">kubectl:</span> {}
</code></pre>
<p>This file simply configures Skaffold to look in the <code>k8s/</code> directory in our repo for our manifest files and processes them with <code>kubectl</code>. Once again there are many other things Skaffold can do for you, and it even functions as its own automation system outside of Cloud Deploy, so if you’d like to learn more then check out <a target="_blank" href="https://skaffold.dev/">https://skaffold.dev/</a></p>
<p>Finally, we’ll add two more steps to our <code>cloudbuild.yaml</code> file, after the existing steps. Here’s the first one:</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">Update</span> <span class="hljs-string">deployment.yaml</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">'gcr.io/google.com/cloudsdktool/cloud-sdk'</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">'bash'</span>
  <span class="hljs-attr">args:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">'-c'</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">|</span>
    <span class="hljs-string">sed</span> <span class="hljs-string">-i</span> <span class="hljs-string">"s/_IMAGE_NAME/gcr.io\/$PROJECT_ID\/sample-app:$COMMIT_SHA/g"</span> <span class="hljs-string">k8s/deployment.yaml</span>
</code></pre>
<p>This is the step we use to update the placeholder in <code>deployment.yaml</code>. When our application is deployed, we want to make sure we’re using the latest build of our container image, so we use <code>sed</code> to update the file in place with the full image name including the SHA hash from the commit.</p>
<p>The final step we add to <code>cloudbuild.yaml</code> actually triggers a release in Cloud Deploy:</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">id:</span> <span class="hljs-string">Trigger</span> <span class="hljs-string">Cloud</span> <span class="hljs-string">Deploy</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">'gcr.io/google.com/cloudsdktool/cloud-sdk'</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">'bash'</span>
  <span class="hljs-attr">args:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">'-c'</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">|
    SHORT_SHA=$(echo $COMMIT_SHA | cut -c1-7)
    gcloud deploy releases create "release-$SHORT_SHA" \
      --delivery-pipeline=sample-app \
      --region=us-central1 \
      --images=sample-app=gcr.io/$PROJECT_ID/sample-app:$COMMIT_SHA</span>
</code></pre>
<p>Note that we reference the name of the pipeline we created earlier with <code>--delivery-pipeline</code>. We also provide a name for the release, which we construct by creating a shorter version of the commit SHA.</p>
<p>With all of these pieces now in place, push your changes to your repo and the following magic will happen:</p>
<ul>
<li><p>Your Cloud Build pipeline will be triggered by your git repo changes</p>
</li>
<li><p>Cloud Build will build your application, tag it and push it to the registry</p>
</li>
<li><p>Cloud Build will also update your <code>deployment.yaml</code> with the new container image tag</p>
</li>
<li><p>It will then trigger a release in your Cloud Deploy pipeline</p>
</li>
<li><p>The Cloud Deploy pipeline will deploy your new application and the Kubernetes manifests to your staging cluster</p>
</li>
</ul>
<h3 id="heading-observing-the-cicd-pipeline">Observing the CI/CD pipeline</h3>
<p>First, your Cloud Build pipeline will run, only now it will contain the extra stages you defined. When it is complete, you should see that Cloud Deploy has been triggered, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398073317/a9a07c6c-f949-450e-a207-a9835179d87c.png" alt class="image--center mx-auto" /></p>
<p>Now if you hop over to the Cloud Deploy dashboard in the Cloud Console, you should see that your pipeline has a new release. A successful release also creates a <strong>roll-out</strong>, which is simply when the desired artifacts get deployed to their target. In other words, your application has now been deployed to your staging cluster!</p>
<p>You can see this from the pipeline view in Cloud Deploy, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398105696/684e293e-856e-48ff-aae8-fda7688664df.png" alt class="image--center mx-auto" /></p>
<p>From the Kubernetes section in the Cloud Console, you can also look at the deployed workloads and see that your sample app is running on the staging cluster:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398154406/3e30933b-8644-47ce-9dff-7442b5e412c2.png" alt class="image--center mx-auto" /></p>
<p>In the details of the deployment, you should be able to locate the external IP of the LoadBalancer service and load that in your browser. You should be greeted with the friendly blue page of a very boring Hello World web app:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398168219/718e0f81-74bd-4f21-9302-66edbea05cce.png" alt class="image--center mx-auto" /></p>
<p>In our sample developer workflow, tests could now be run on the latest version of the application in our staging environment. Once we are happy with the updated application, we can promote it to production directly from Cloud Deploy.</p>
<p>From the Cloud Deploy pipeline dashboard, we simply select <strong>Promote</strong>. This prompts us to confirm the promotion target (which in this case is our production cluster) and provides us with a summary of the actions that the system will take once we start the promotion. Go ahead and click <strong>Promote</strong> again and you can watch what happens:</p>
<ul>
<li><p>The current release creates a new <strong>roll-out</strong>, this time targeting the production cluster.</p>
</li>
<li><p>The application artifact and Kubernetes manifests are rendered to this new target.</p>
</li>
</ul>
<p>You’ll now have 2 workloads running – your sample app in each cluster:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740398222813/28cbf7f5-541d-4823-9ed5-4b735e6a7c41.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-updating-apps-through-the-cicd-workflow">Updating apps through the CI/CD workflow</h3>
<p>Now we’ve finished the initial roll-out of the application, you can simulate the process for changes as well. Everything is now in place so that next time code is pushed to the main branch of the repo, the CI/CD process runs again. The new version of the app will be built and automatically pushed to the staging cluster, ready for you to test and approve for promotion to production.</p>
<p>You can try this for yourself by making a simple change – I would suggest altering the background colour of the app to something more appealing! You can find the <code>bgcolor</code> variable defined in <code>server.py</code>.</p>
<p>In this walkthrough, we’ve built a reference CI/CD architecture for Cloud Build and Cloud Deploy. But of course, we’ve only just scratched the surface of what these tools can do. For example:</p>
<ul>
<li><p>Cloud Deploy can automatically promote rollouts for you, based on specific rules and targets.</p>
</li>
<li><p>Cloud Deploy also supports complex deployment patterns like canary testing and deploying to multiple targets at once.</p>
</li>
<li><p>Skaffold configuration can further customise each target, for example by defining a specific number of <code>Pod</code> replicas per target.</p>
</li>
</ul>
<p>Now you understand the basics of these systems, you may be ready for some more advanced approaches. An even more in-depth tutorial that includes frameworks for different programming languages can be found at <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/tutorials/modern-cicd-gke-reference-architecture">https://cloud.google.com/kubernetes-engine/docs/tutorials/modern-cicd-gke-reference-architecture</a></p>
<h2 id="heading-some-considerations-for-private-build-infrastructure">Some considerations for private build infrastructure</h2>
<p>Cloud Build and Cloud Deploy are fantastic tools for automating the various build and deploy stages of your CI and CD pipelines. Each step of your Cloud Build pipeline runs as an isolated and containerised process with access to a shared workspace for the pipeline's duration. Under the hood, when Cloud Deploy executes steps in your CD pipelines it also uses workers in Cloud Build.</p>
<p>However, by default these workers are running in a shared pool as part of their respective services. In many cases this might not be desirable: you may have regulatory requirements that mean all compute processes have to stay within certain network boundaries, or you may have components that only accept private connections within your VPC that are required for steps in your pipelines.</p>
<p>Thankfully, Cloud Build allows you to configure <strong>private pools</strong>, which are pools of worker resources that run within a dedicated private VPC. This service producer VPC is still Google-managed and does not exist within your project, however you can peer with it directly for a private connection that only requires internal IP addresses.</p>
<p>Alternatively, if you don’t want to configure VPC peering, you can use a dedicated private pool but allow it to use public endpoints to communicate with resources in your VPC. You might choose this option if your VPC already has public endpoints, and you just want a private pool to give you more options over the configuration of the Cloud Build workers.</p>
<p>You can create a private pool using the following command. There are lots of parameter placeholders here, which I’ll explain in moment:</p>
<pre><code class="lang-bash">gcloud builds worker-pools create &lt;PRIVATEPOOL_ID&gt; \
    --project=&lt;PROJECT_ID&gt; \
    --region=&lt;REGION&gt; \
    --peered-network=&lt;PEERED_NETWORK&gt; \
    --worker-machine-type=&lt;MACHINE_TYPE&gt; \
    --worker-disk-size=&lt;DISK_SIZE&gt; \
    --no-public-egress
</code></pre>
<p>Let’s walk through those parameters:</p>
<ul>
<li><p><code>PRIVATEPOOL_ID</code> is the unique name that you assign to your private pool.</p>
</li>
<li><p><code>PROJECT_ID</code> is the name of the project that you wish to use with this pool.</p>
</li>
<li><p><code>REGION</code> is your chosen region where the private pool should run.</p>
</li>
<li><p><code>PEERED_NETWORK</code> is the VPC network in your project that should be peered with Google’s private service producer network. Note that you’ll need to specify the long resource name for this parameter, for example: <code>projects/my-project/global/networks/my-vpc</code>.</p>
</li>
<li><p><code>MACHINE_TYPE</code> allows you to choose the machine type for your workers. Several types of Compute Engine instance type are supported, and if you don’t specify a preference, an <code>e2-standard-2</code> will be used. Note that this parameter can also be overridden at runtime by passing the <code>--machine-type</code> option to <code>gcloud builds submit</code>.</p>
</li>
<li><p><code>DISK_SIZE</code> lets you adjust the disk size of the worker instances, which must be between 100 and 4000 GB. Once again this is optional; Cloud Build will default to a size of 100GB. This can also be overridden at runtime with the <code>--disk-size</code> option.</p>
</li>
<li><p>The <code>--no-public-egress</code> option ensures that workers are created without an external IP address.</p>
</li>
</ul>
<p>Note that these parameters can also be passed in a configuration file which defines a private pool schema. The layout of this schema is documented at <a target="_blank" href="https://cloud.google.com/build/docs/private-pools/private-pool-config-file-schema">https://cloud.google.com/build/docs/private-pools/private-pool-config-file-schema</a></p>
<p>Once the private pool has been created, you can choose to use it any time Cloud Build executes a pipeline. If you’re submitting a build job manually, you can specify the name of the worker pool on the command line. For example:</p>
<pre><code class="lang-bash">gcloud builds submit --config=cloudbuild.yaml \
  --worker-pool=projects/my-project/locations/us-central1/workerPools/my-privatepool
</code></pre>
<p>You can also specify the private pool directly in your <code>cloudbuild.yam</code>l, which helps if your builds are being triggered by a git commit. Simply add the pool configuration to an options section at the bottom of your <code>cloudbuild.yaml</code> file like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">steps:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">'bash'</span>
  <span class="hljs-attr">args:</span> [<span class="hljs-string">'echo'</span>, <span class="hljs-string">'Previous steps up to this point'</span>]
<span class="hljs-attr">options:</span>
  <span class="hljs-attr">pool:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">'projects/my-project/locations/us-central1/workerPools/my-privatepool'</span>
</code></pre>
<p>As we mentioned earlier, all Cloud Deploy steps also run within Cloud Build workers, so we can also specify a Cloud Build private pool to be used in our Cloud Deploy pipelines. Cloud Build is used for rendering manifests (using Skaffold) as well as deploying to clusters (using <code>kubectl</code>) and we can specify which worker pools to use as well as for which stages.</p>
<p>We do this by updating <code>clouddeploy.yaml</code> and adding an <code>executionConfig</code> to the object definition for each Target (ie. each cluster). Here’s an example using the production cluster from our earlier walkthrough:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">deploy.cloud.google.com/v1</span> 
<span class="hljs-attr">kind:</span> <span class="hljs-string">Target</span> 
<span class="hljs-attr">metadata:</span> 
  <span class="hljs-attr">name:</span> <span class="hljs-string">prod</span> 
<span class="hljs-attr">description:</span> <span class="hljs-string">production</span> <span class="hljs-string">cluster</span> 
<span class="hljs-attr">gke:</span> 
  <span class="hljs-attr">cluster:</span> <span class="hljs-string">projects/PROJECT_ID/locations/us-central1-c/clusters/prod-cluster</span>
<span class="hljs-attr">executionConfigs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">privatePool:</span>
      <span class="hljs-attr">workerPool:</span> <span class="hljs-string">"projects/my-project/locations/us-central1/workerPools/my-privatepool"</span>
    <span class="hljs-attr">usages:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">RENDER</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">DEPLOY</span>
</code></pre>
<p>Using private pool configurations like this means that you can still leverage Cloud Build and Cloud Deploy, even if you have completely private infrastructure. This is helpful if you’re using hybrid connections between your on-premises networks or other clouds via your VPC, or if you’re just using private GKE clusters.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this post I’ve hopefully established a compelling argument for using automation tools and a CI/CD architecture for your software development lifecycle. As I stated earlier, while these tools and practices can enable rapid iteration and development, that’s normally not the goal in and of itself. In order to support fast but safe deployments, all of these additional mechanisms work together to make your deployments more reliable. This means you’ll have much more confidence changing the state of your workloads or fixing them if something goes wrong.</p>
<p><strong>And that’s a wrap! (for now…)</strong></p>
<p>Ten posts on a single subject seems like a good milestone! I’ve found it fascinating digging deep into the world of GKE Enterprise, and I hope through these posts I’ve made it a bit more accessible for anyone out there trying to learn more about these topics.</p>
<p>Stay tuned to my blog for more general Kubernetes and cloud writing, and maybe some other completely different stuff too! 😄</p>
]]></content:encoded></item><item><title><![CDATA[Application Security in GKE Enterprise]]></title><description><![CDATA[This is the ninth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to supp...]]></description><link>https://timberry.dev/application-security-in-gke-enterprise</link><guid isPermaLink="true">https://timberry.dev/application-security-in-gke-enterprise</guid><category><![CDATA[binaryauth]]></category><category><![CDATA[cloudarmor]]></category><category><![CDATA[identityawareproxy]]></category><category><![CDATA[gke]]></category><category><![CDATA[Security]]></category><category><![CDATA[cybersecurity]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[kubernetes network policies]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Thu, 06 Feb 2025 11:28:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737732221015/1d427cc4-8bb4-43f9-98ae-87775e21ef38.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the ninth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>Regardless of where our clusters run or how our applications are configured, security must be a fundamental consideration in any design. The power and flexibility of Kubernetes can bring an increased level of complexity, enlarging attack surfaces and making security considerations all the more important. In <a target="_blank" href="https://timberry.dev/securing-gke-workloads-with-service-mesh?source=more_series_bottom_blogs">my last post</a>, I explored how Cloud Service Mesh can help us to identify our workloads and then encrypt and control their communication. Even though we’ve completed our Service Mesh journey in this series, we’ll now continue the topic of security in this post and learn some of the most useful Kubernetes features that can help us keep our workloads secure. Heads up that some of these are specific to GKE, and some of are just generally available in Kubernetes.</p>
<p>The one thing we cannot afford to do is be complacent about security! A long time ago people used to make terrible mistakes like assuming an application was secure if it was “behind the firewall”, but as we know, the Internet is a zero-trust environment and should be treated as such. The Cloud Native Computing Foundation (CNCF) summarises this with its “4 Cs” that we should care about when it comes to security: Code, Container, Cluster and Cloud. It’s our job to secure each of these things to the best of our ability!</p>
<p>To achieve this, we’re going to learn the following things in this post:</p>
<ul>
<li><p>Leveraging Workload Identity for your apps</p>
</li>
<li><p>Trusting container deployments with Binary Authorization</p>
</li>
<li><p>Integrating other Google Cloud security tools</p>
</li>
<li><p>Securing Traffic with Kubernetes Network Policies</p>
</li>
</ul>
<p>Of course, we can’t cover every Kubernetes security topic in this post! We’re going to investigate some of these specialised areas, but you should already have a solid grounding in Kubernetes security concepts such as Role-Based Access Control (RBAC) and namespaces. If you need a quick brush up, I’d recommend reviewing the documentation here: https://kubernetes.io/docs/concepts/security/</p>
<h2 id="heading-leveraging-workload-identity-for-your-apps">Leveraging Workload Identity for your apps</h2>
<p>I’ve mentioned Workload Identity a few times already in this series of posts, and chances are that you have it enabled by default in your fleet settings. Like most Google products, Workload Identity has a self-explanatory name, and provides your workloads with an identity. But why is that important?</p>
<p>One of the most fundamental security concepts you should embed in your system designs is the <strong>principle of least privilege</strong>. Simply put: this principle states that any component of a system should be granted <em>only</em> the permissions it requires to perform its function, and nothing greater. Okay, but why is <em>that</em> important?</p>
<p>You can consider any component in your system as a potential attack surface, vulnerable to being compromised and used as a jumping off point into other parts of your system. The permissions that a single component is granted will determine what other parts of your system it can access and to what extent (for example, read only or read and write). By reducing the permissions granted to each component you are therefore reducing the size of that attack surface.</p>
<p>Historically, when Google’s Cloud IAM was in its infancy, the principle of least privilege was difficult to achieve. All Google Cloud services run with a service account as an identity, and it was common for compute services (including GKE nodes) to run with the default Compute Engine service account. Some early Google Cloud engineer in their naivety, figured that this service account should have the “Project Editor” role. The upshot of this meant that any compromised service suddenly had complete access to everything in a project – all the other services, all the data, all the APIs.</p>
<p>Fast forward to today when we have an extremely well structured and powerful Cloud IAM service, and a set of carefully crafted predefined roles. Now when we need a workload to access something, we can define an IAM binding that grants a specific set of permissions only, therefore achieving the principle of least privilege. All we have to do is identify ourselves as the service account to which the IAM roles have been bound.</p>
<p>This used to mean downloading and managing service account keys, but this created an additional attack surface! If you accidentally left your keys lying around, someone could use them to essentially steal the identity of a service account and access any systems for which it had permissions. However, recently all of the compute platforms in Google Cloud have been improved so that manual keys are no longer required. Instead, the platform itself will manage a short-lived token to provide access to a service account.</p>
<h3 id="heading-workload-identity-federation">Workload Identity Federation</h3>
<p>This capability has now been extended to <strong>Kubernetes service accounts</strong> within GKE (note that these are Kubernetes objects inside your cluster, <em>not</em> Cloud IAM service accounts – although the two can be used together, more on that in a moment!)</p>
<p>When you run a <code>Pod</code> workload with a Kubernetes service account identity, you can now create an IAM binding for that identity to specific IAM roles and permissions. For example, if your <code>Pod</code> workload needs to access the Cloud Storage API to write to a specific bucket, you can grant the specific granular permission to allow this with an IAM binding, rather than using the default service account of the cluster node.</p>
<p>To do this, you reference the Kubernetes service account as a principal in the membership of the IAM policy binding. The membership identifier is a bit cumbersome however, as it contains your project number, the name of your workload identity pool, a namespace and the name of the Kubernetes service account.</p>
<p>Here’s an example where the project number is <code>123456123456</code> and the project name is <code>my-project</code>, which makes the workload identity pool <code>my-project.svc.id.goog</code>. Assuming we have a namespace called <code>frontend</code> and a Kubernetes service account called <code>fe-web-sa</code>, this would make the identifier:</p>
<pre><code class="lang-plaintext">principal://iam.googleapis.com/projects/123456123456/locations/global/workloadIdentityPools/my-project.svc.id.goog/subject/ns/frontend/sa/fe-web-sa
</code></pre>
<p>Federated identity works in all GKE clusters, but clusters in fleets have an additional advantage. Rather than an identity pool spanning a single cluster, all clusters attached to a fleet share the same fleet-wide pool. This means that you can create an IAM binding just once, and it will be applied to all clusters in the fleet where the specified namespace and service account name match (bonus points if you remember the concept of <em>sameness</em> that I talked about very early on in this series!)</p>
<p>Let’s walk through an example of creating a workload that access the Cloud Storage API to demonstrate some best practices. We’ll assume we have a storage bucket that contains some files, and a GKE cluster with Workload Identity enabled. In this example, we’ll add the IAM policy binding at the level of an individual bucket, not at the project level.</p>
<p>Following the examples we used earlier to explain the IAM principal membership, let’s go ahead and create a namespace called <code>frontend</code> and a Kubernetes service account called <code>fe-web-sa</code>:</p>
<pre><code class="lang-bash">kubectl create namespace frontend
kubectl -n frontend create serviceaccount fe-web-sa
</code></pre>
<p>Now that we’ve created a Kubernetes service account, we can simply reference it as the member (or principal) in an IAM policy binding. In this example, we’ll grant the <strong>Storage Object User</strong> role to a bucket called <code>frontend-files</code>. This predefined IAM role grants access to create, view, list, update and delete objects and their metadata, but it doesn’t grant the user any further permissions to manage ACLs or IAM policies. This is an example of <em>just the right level of privilege</em> for a use-case where a workload needs read-write access to objects in a bucket, without being able to change the bucket itself.</p>
<p>We’ll create the binding with this command:</p>
<pre><code class="lang-bash">gcloud storage buckets add-iam-policy-binding gs://frontend-files \
  --member=serviceAccount:my-project.svc.id.goog/subject/ns/frontend/sa/fe-web-sa \
  --role=roles/storage.objectUser
</code></pre>
<p>Now, let’s create a workload that actually uses these permissions. In a real-world use case, we can perhaps imagine a front-end web server <code>Pod</code> that grabs its files from Cloud Storage on startup. For testing purposes, we’ll just spin up a <code>Pod</code> that contains the <code>gcloud</code> tool so we can test if our permissions work. We’ll define the <code>Pod</code> as follows:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">test-pod</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">frontend</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">serviceAccountName:</span> <span class="hljs-string">fe-web-sa</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">test-pod</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">google/cloud-sdk:slim</span>
    <span class="hljs-attr">command:</span> [<span class="hljs-string">"sleep"</span>,<span class="hljs-string">"infinity"</span>]
</code></pre>
<p>Once the container is up and running, we can test permissions by running some <code>gcloud</code> commands that should only be possible with the correct IAM binding. For example, we could list the contents of the bucket:</p>
<pre><code class="lang-bash">kubectl -n frontend <span class="hljs-built_in">exec</span> -it test-pod -- gcloud storage ls gs://frontend-files
</code></pre>
<p>Or try to create a new file by copying one that already exists inside the bucket:</p>
<pre><code class="lang-bash">kubectl -n frontend <span class="hljs-built_in">exec</span> -it test-pod -- gcloud storage cp gs://frontend-files/test1.txt gs://frontend-files/test2.txt
</code></pre>
<p>Of course, this isn’t the sort of thing your workloads are likely to be doing, but the key thing is that our <code>Pod</code> uses its service identity to get the permissions it needs, and <em>only</em> those permissions. Thanks to identity federation, we were able to treat our Kubernetes service account as a principal in an IAM binding just like we would with a Cloud IAM service account.</p>
<p>However, at the time of writing there were some limitations with specific APIs and the extent to which federated identities could be used with them. You can find these details documented here: <a target="_blank" href="https://cloud.google.com/iam/docs/federated-identity-supported-services">https://cloud.google.com/iam/docs/federated-identity-supported-services</a></p>
<h3 id="heading-identity-with-unsupported-apis">Identity with unsupported APIs</h3>
<p>If you need to control access to a specific Google Cloud API in a way that is not supported by federated identity, you can still fall back to a legacy method that achieves the same thing. In this method, we create a Cloud IAM service account with the correct IAM bindings and permissions, then we allow the Kubernetes service account to <em>impersonate</em> the IAM service account.</p>
<p>There are a few more moving parts to this process. Assuming we have already created our Kubernetes service account <code>fe-web-sa</code> and our workload that uses that service account, we now need to create a matching Cloud IAM service account, which in this case we’ll call <code>iam-fe-web-sa</code>:</p>
<pre><code class="lang-bash">gcloud iam service-accounts create iam-fe-web-sa
</code></pre>
<p>In this scenario, we add the policy bindings to the Cloud IAM service account. So, following our previous example, we’ll grant it the <strong>Storage Object User</strong> role on the same bucket as before. Remember that Cloud IAM service accounts are identified by an email address in the format of <code>&lt;service-account-name&gt;@&lt;project-name&gt;.iam.gserviceaccount.com</code> which makes our command:</p>
<pre><code class="lang-bash">gcloud storage buckets add-iam-policy-binding gs://frontend-files \
  --member=serviceAccount: iam-fe-web-sa@my-project.iam.gserviceaccount.com \
  --role=roles/storage.objectUser
</code></pre>
<p>Next, we need to create an IAM policy that allows the Kubernetes service account to impersonate the IAM service account:</p>
<pre><code class="lang-bash">gcloud iam service-accounts add-iam-policy-binding iam-fe-web-sa@my-project.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member <span class="hljs-string">"serviceAccount:my-project.svc.id.goog/subject/ns/frontend/sa/fe-web-sa"</span>
</code></pre>
<p>And finally, we annotate the Kubernetes service account so that GKE understands there is a link between the two service accounts:</p>
<pre><code class="lang-bash">kubectl annotate serviceaccount fe-web-sa \
  --namespace frontend \
  iam.gke.io/gcp-service-account=iam-fe-web-sa@my-project.iam.gserviceaccount.com
</code></pre>
<p>This essentially triggers the same process as identity federation and allows the cluster to obtain and use short-lived credentials automatically. These approaches may seem like extra effort, but once again: it cannot be stressed enough how important least privilege is as a security principle. In the unfortunate event of an attacker gaining access to a component of your system, these practices drastically reduce the further damage they can do.</p>
<p>So, we’ve covered regulating the API access that our workloads have when running in our clusters, but what about regulating what workloads will run in the first place?</p>
<h2 id="heading-trusting-container-deployments-with-binary-authorization">Trusting container deployments with Binary Authorization</h2>
<p>If you recall the 4 Cs we discussed earlier, the first and most important is “Code”. We can put all kinds of structural security in place, but we’re still trusting the code that runs inside our containers. It’s helpful then to have a way to guarantee that only trusted container images should be allowed to run, and all other containers images should be blocked. This is the purpose of Binary Authorization.</p>
<p>Binary Authorization works with two main concepts: <em>attestations</em> and <em>policies:</em></p>
<ul>
<li><p>An <strong>attestation</strong> is essentially an electronic signature, attached to a container image, that confirms that the image has been signed by an attestor. An attestor can run at any stage in your container image build pipeline to provide important safeguards in the build process. For example, you may want to leverage Google’s Artifact Analysis service to scan the metadata of your images and check their vulnerability severity. Or you can use third-party scanning services like Kritis Signer or Voucher or use your own services to check containers as they go through the build process. The important thing is that at each stage, if the attestor is happy that the container is compliant, it creates a signed attestation.</p>
</li>
<li><p>A Binary Authorization <strong>policy</strong> then determines which attestations are required for a container image to be deployed. Optional rules can be added to a policy to control which clusters and service identities can deploy an image, and these can be scoped to a namespace if you wish. You can also choose if policies should just evaluate or enforce the rules you have set.</p>
</li>
</ul>
<p>A typical CI/CD pipeline that embeds Binary Authorization would leverage an attestation at each environment stage, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738340672641/13636087-f261-40ef-a403-2bb0980e8207.png" alt class="image--center mx-auto" /></p>
<p>Here we can see the usual process of building and pushing an image on a commit. The build pipeline then triggers the vulnerability scan, and we determine if the output of the scan deems the image to be safe or not. For example, an image that contains no CVEs with a severity score greater than five could possibly be considered safe (but your mileage may vary!) At this point, a safe image can generate an attestation that can be checked later.</p>
<p>The pipeline can go on to deploy to a staging environment, and then run quality assurance tests to make sure the updated deployment works as expected. These tests could be automated or manually run by a member of a QA team, and if they are passed, we generate another attestation to say so.</p>
<p>Finally, when the image is due to be deployed to the production cluster, a Binary Authorization policy can ensure that both positive attestations are in place, and check that they have been signed by the relevant attesters. If they are not, the deployment will be denied, protecting the production environment.</p>
<p>Unfortunately, configuring Binary Authorization in a CI/CI pipeline is a complex process which would warrant an entire blog post series of its own, so I can’t cover it here! If you want to try this out for yourself, Google’s own guide is recommended: <a target="_blank" href="https://cloud.google.com/binary-authorization/docs/cloud-build">https://cloud.google.com/binary-authorization/docs/cloud-build</a></p>
<p>Let’s jump ahead to the final C in the 4 we discussed at the start of this post. Even if our code, container and cluster are air-tight, surely there are some platform tools in Google Cloud that can help us?</p>
<h2 id="heading-integrating-other-google-cloud-security-tools">Integrating other Google Cloud security tools</h2>
<p>You may already be familiar with some of the tools available in Google Cloud that can generally help with platform security. In this section we’ll discuss two of them specifically – <strong>Cloud Armor</strong> and <strong>Cloud IAP</strong> – and how they can be integrated into the GKE Enterprise configurations we’ve been discussing so far. Let’s start with Cloud Armor!</p>
<h3 id="heading-cloud-armor">Cloud Armor</h3>
<p>Google’s Cloud Armor is essentially a Web Application Firewall (WAF), applied as a set of configuration polices which combine with Google’s different load balancing options to filter incoming Layer 7 network traffic and protect your workloads and backends. Cloud Armor policies can comprise multiple rules, and there are several types of rules you can use:</p>
<ul>
<li><p>IP <code>allowlist</code> and <code>denylist</code> rules can filter traffic based on IP or CIDR, and can return a variety of HTTP response codes</p>
</li>
<li><p>Source geography rules can exclude traffic from specific geolocations</p>
</li>
<li><p>Preconfigured WAF rules can protect your workloads from common attacks maintained by the Open Worldwide Application Security Project (OWASP)’s Core Ruleset (CSR) list. These include attacks like SQL injection, cross scripting, PHP injection and many more.</p>
</li>
<li><p>Bot Management rules help you manage requests that may be coming from automated clients or bots. For example, you can force requests to identify themselves through a reCAPTCHA challenge.</p>
</li>
<li><p>Rate limiting rules can throttle requests or temporarily ban clients that exceed a predefined rate threshold.</p>
</li>
</ul>
<p>You can also write custom rules using Google’s Common Expression Language (CEL). Cloud Armor works with any Google Cloud Load Balancer and is not limited to backends on GKE. However, if you’ve configured your cluster ingress using the GKE Gateway Controller you can easily associate a Cloud Armor policy with your Gateway to protect the services it exposes.</p>
<p>We do this by creating a <code>GCPBackendPolicy</code> object. This object contains details of the additional functionality that should be added to the load balancer in its spec, such as referencing a Cloud Armor policy. The <code>GCPBackendPolicy</code> then targets a specific <code>Service</code>, or in the case of a multi-cluster service, a <code>ServiceImport</code>.</p>
<p>Here’s an example where we have already created a Cloud Armor policy called <code>web-security-policy</code>, and we want to use it to protect a backend <code>Service</code> called <strong>store</strong>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.gke.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">GCPBackendPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">webstore-backend-policy</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">default:</span>
    <span class="hljs-attr">securityPolicy:</span> <span class="hljs-string">web-security-policy</span>
  <span class="hljs-attr">targetRef:</span>
    <span class="hljs-attr">group:</span> <span class="hljs-string">""</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
</code></pre>
<p>Note that the <code>GCPBackendPolicy</code> must exist in the same namespace as the Gateway that it is attached to, and only a single <code>GCPBackendPolicy</code> object may be used per service. The <code>targetRef</code> points to our <strong>store</strong> <code>Service</code>; the <code>group</code> parameter is blank because <code>Services</code> belong to the core API group. If we had set up <strong>store</strong> as a multi-cluster service, we would simply change the <code>targetRef</code> section to refer to the <code>ServiceImport</code> object instead, like this:</p>
<pre><code class="lang-bash">  targetRef:
    group: net.gke.io
    kind: ServiceImport
    name: store
</code></pre>
<p>At the time of writing, some limited aspects of the <code>GCPBackendPolicy</code> could be applied to the <code>Gateway</code> object itself, not just the service, which could potentially save you a lot of time if you want to apply the same security policy to lots of different services. It’s worth checking the documentation at <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources">https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources</a> to see if this support has been expanded.</p>
<h3 id="heading-identity-aware-proxy">Identity Aware Proxy</h3>
<p>Google’s Identity Aware Proxy (IAP) is an additional layer of protection that can be enabled for applications that are exposed via load balancers, in use cases where you only want to grant access to your own organization’s users. As part of Google’s “BeyondCorp” methodology, it provides a simple way to prove a user’s identity rather than having them connect to an internal application using a VPN.</p>
<p>IAP works by intercepting requests that travel through a load balancer and redirecting users to the Google sign-in page. Here they must successfully authenticate themselves with Google’s identity services. This completes the <em>authentication</em> stage of IAP.</p>
<p>Next comes <em>authorization</em>. For a user’s request to be passed to the backend service, IAP checks that the correct IAM policy is in place to allow this. A user requires the <strong>IAP-Secured Web App User</strong> role assigned to them in the same Google Cloud project as the backend resource.</p>
<p>The benefit of this functionality is that IAP is easy to drop in to provide a highly secure method of authentication and authorization without having to modify any of your backend workloads. The downside is that it only works for users who actually have accounts within your Google organization (ie. your Google domain), because you’ll need to be able to assign IAM roles to them within your project. Just in case this scenario is relevant to you, let’s discuss how we can add IAP to one of our GKE backend services.</p>
<p><strong>Configuring the Consent Screen</strong></p>
<p>The Consent Screen is the user interaction that appears when a user is asked to sign into their Google account to access your service, as shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738341700978/867a9b98-3841-4edb-9ece-4a918dfdf2a6.png" alt class="image--center mx-auto" /></p>
<p>We’ve all seen this a hundred times; we’re logging in with our Google credentials, but a third-party application wants to know who we are. When you enable IAP for your Google Cloud project, you’ll need to configure a consent screen with the name of your application, some contact details and optionally an image for a logo. Then you’ll need to decide what scopes your consent screen is asking for.</p>
<p>The scopes in your consent screen determine what information you want Google to pass onto your application if authentication is successful. For example, if you’re trying to access a third-party application that needs to write files to your Google Drive, you would be presented with this information at the consent screen. An application can’t access scopes unless they are specifically authorized by the end-user in this way. It might be that all you need is a user’s email address to identify them in your own data, but there are dozens of other scopes you can include if they are necessary.</p>
<p>Once you’re completed the Consent Screen setup, IAP is enabled on your project, but there’s still one more step to take. You’ll need to generate an OAuth 2.0 Client ID so that your IAP integration has access to the correct APIs. You can do this from the <strong>Credentials</strong> screen in the <strong>APIs</strong> section of the Cloud Console. Once you’ve created the OAuth 2.0 Client ID, download the credentials file and extract the client ID and secret. We’ll use these in a moment.</p>
<p><strong>Configuring IAP for GKE backends</strong></p>
<p>Any services running in your cluster should now be available as backends that can be secured by IAP (along with any services in your project from other Compute options, such as Compute Engine, App Engine or Cloud Run). You can see a list of these backends to confirm your GKE services are included with this command:</p>
<pre><code class="lang-bash">gcloud compute backend-services list
</code></pre>
<p>Don’t be concerned with the <code>compute</code> sub-command! Services exposed by the GKE Gateway controller will show up in that list. You can also see this list on the IAP page in the Cloud Console, although at this stage it will show that OAuth is not properly configured. We’ll fix that next!</p>
<p>First, we need to store the OAuth 2.0 secret in a Kubernetes Secret object. Take the secret that you extracted from the credentials file a moment ago and write it into a text file called <code>secret.txt</code>. Then we’ll store this in a Secret called <code>oauth-secret</code>:</p>
<pre><code class="lang-bash">kubectl -n store create secret generic oauth-secret --from-file=key=secret.txt
</code></pre>
<p>Now we can create the <code>GCPBackendPolicy</code> to attach IAP to our service. In this example, you’ll need to replace <code>&lt;CLIENT_ID&gt;</code> with the OAuth 2.0 client ID from your credentials file:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.gke.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">GCPBackendPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">iap-backend-policy</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">default:</span>
    <span class="hljs-attr">iap:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">oauth2ClientSecret:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">oauth-secret</span>
      <span class="hljs-attr">clientID:</span> <span class="hljs-string">&lt;CLIENT_ID&gt;</span>
  <span class="hljs-attr">targetRef:</span>
    <span class="hljs-attr">group:</span> <span class="hljs-string">""</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
</code></pre>
<p>Once again, we can change the <code>targetRef</code> like before if we’re using a <code>ServiceImport</code> rather than a <code>Service</code>. It will take a few minutes for the configuration to synchronize, but you can check on its status with this command:</p>
<pre><code class="lang-bash">kubectl -n store describe gcpbackendpolicy
</code></pre>
<p>This should show you a message like: <strong>Application of GCPGatewayPolicy "default/backend-policy" was a success</strong>. You should now also see that everything is okay in the IAP page of the Cloud Console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738341828275/3d0f6da4-18ef-4b25-a456-ca6f7f9d7027.png" alt class="image--center mx-auto" /></p>
<p>External users who now try to access your application through the load balancer will be redirected to the Google sign-in process using the consent screen that you configured. That user will need to successfully authenticate with their Google credentials, and they’ll need the <strong>IAP-Secured Web App User</strong> role in your project’s IAM bindings to be granted access.</p>
<p>Other policies are available to help you configure your Gateway resources, but these two are the most useful regarding application security. For full details you can check out the documentation here: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources">https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources</a></p>
<h2 id="heading-securing-traffic-with-kubernetes-network-policies">Securing Traffic with Kubernetes Network Policies</h2>
<p>In the previous post we described using a collection of <code>AuthorizationPolicy</code> objects with our service mesh to control the flow of traffic between our various workloads. As we’ve already learned, the service mesh approach is very powerful and quite comprehensive. However, it can be difficult to apply control logic between your workloads and other points on the network that exist outside of your mesh. And after reading the last few posts in this series, you may have actually decided that you don’t want a mesh after all!</p>
<p>Thankfully, we still have a Kubernetes-native form of traffic control in the form of <code>NetworkPolicies</code>. While these objects do not require a service mesh, the network plugin for GKE must be updated to support them. You can use the <code>--enable-network-policy</code> argument when creating a new cluster, or enable network policy enforcement on an existing cluster with this command:</p>
<pre><code class="lang-bash">gcloud container clusters update my-cluster \
  --update-addons=NetworkPolicy=ENABLED
</code></pre>
<p><code>NetworkPolicies</code> can look a little confusing at first, but once you understand their logic, they are quite easy to read. The primary components of the object are:</p>
<ul>
<li><p>A <code>podSelector</code>: This element chooses which Pods should be affected by the policy</p>
</li>
<li><p>Policy types: You can include <strong>Ingress</strong> and <strong>Egress</strong> rules in a policy, for traffic entering and leaving a <code>Pod</code> respectively.</p>
</li>
<li><p>The rules themselves: Inside your Ingress and Egress elements, you can specify rules that determine whether traffic is allowed. These rules are based on additional <code>Pod</code> selectors, namespace selectors or IP blocks to match the source of traffic for Ingress rules, or the destination of traffic for Egress rules.</p>
</li>
</ul>
<p>Let’s take a look at the example <code>NetworkPolicy</code> from the Kubernetes documentation to walk through an example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">test-network-policy</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">default</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">role:</span> <span class="hljs-string">db</span>
  <span class="hljs-attr">policyTypes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Egress</span>
  <span class="hljs-attr">ingress:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">ipBlock:</span>
        <span class="hljs-attr">cidr:</span> <span class="hljs-number">172.17</span><span class="hljs-number">.0</span><span class="hljs-number">.0</span><span class="hljs-string">/16</span>
        <span class="hljs-attr">except:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-number">172.17</span><span class="hljs-number">.1</span><span class="hljs-number">.0</span><span class="hljs-string">/24</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">namespaceSelector:</span>
        <span class="hljs-attr">matchLabels:</span>
          <span class="hljs-attr">project:</span> <span class="hljs-string">myproject</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">podSelector:</span>
        <span class="hljs-attr">matchLabels:</span>
          <span class="hljs-attr">role:</span> <span class="hljs-string">frontend</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">6379</span>
  <span class="hljs-attr">egress:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">to:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">ipBlock:</span>
        <span class="hljs-attr">cidr:</span> <span class="hljs-number">10.0</span><span class="hljs-number">.0</span><span class="hljs-number">.0</span><span class="hljs-string">/24</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">5978</span>
</code></pre>
<p>At the start of the <code>spec</code> is a <code>podSelector</code>, which determines which <code>Pods</code> will be affected by this policy by “selecting” them. You may see <code>podSelectors</code> in other parts of the policy, and this is why indentation levels in YAML are so important! In this example, all <code>Pods</code> that contain the label <code>role</code> with the value <code>db</code> in their metadata will be selected and affected.</p>
<p>If you don’t want to select specific <code>Pod</code> labels, you can instead affect an entire namespace. You can do this by specifying a namespace (either in the YAML or at the time of applying a manifest), and using an empty <code>podSelector</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">all-pods-policy</span>
  <span class="hljs-attr">metadata:</span> <span class="hljs-string">secure-namespace</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span> {}
</code></pre>
<p>Going back to the example from the Kubernetes documentation, once we know which <code>Pods</code> are being affected, we then specify which <code>policyTypes</code> we’re going to include in our policy. Once again, we specify <strong>Ingress</strong> rules to control how traffic should be allowed into selected Pods, and <strong>Egress</strong> rules to control how traffic should be allowed out. Each set of rules then gets its own section in our YAML manifest. Incidentally, if you just want to create <strong>Ingress</strong> rules on their own, you can leave out the <code>policyTypes</code> section entirely. By default, Kubernetes will then expect you only to define the <strong>Ingress</strong> section.</p>
<p>Next is our <code>ingress</code> section, where we specify the sources from which we will accept network traffic. Put another way, the <code>Pods</code> we selected a moment ago will only receive traffic from sources that match these rules. In our example, we have three different types of source:</p>
<ul>
<li><p>The <code>ipBlock</code> specifies a network range of <code>172.17.0.0/16</code>, but excludes the specific sub-range of <code>172.17.1.0/24</code>.</p>
</li>
<li><p>The <code>namespaceSelector</code> uses labels to match the originating namespace of any <code>Pod</code>. In this case, <code>Pods</code> must come from a namespace that contains the metadata label <code>project</code> with a value of <code>myproject</code> for this rule to evaluate to true.</p>
</li>
<li><p>The <code>podSelector</code>, used as a source, matches <code>Pods</code> that contain the label <code>role</code> with the value of <code>frontend</code>.</p>
</li>
</ul>
<p>It’s important to note that <em>any of these sources</em> can match for this rule to evaluate to true and for the ingress traffic to be allowed. However, we also have a <code>ports</code> section in our ingress rule. This specifies that only TCP traffic on port 6379 will be accepted. As you can see, this is a fairly restrictive policy.</p>
<p>Finally, we have an <code>egress</code> section. We could use the same variety of sources and selectors as we used for ingress, but in this example, we simply specify an <code>ipBlock</code> and <code>ports</code> parameter. This means that Pods selected by this policy will only be allowed to connect to IP addresses within the specified CIDR range on TCP port 5978.</p>
<h3 id="heading-designing-network-policies">Designing network policies</h3>
<p>So how should we design our network policies? They can quickly become complex and cumbersome, so it's important to understand some basic logic of how they work. As we’ve already seen, network policies can apply two types of restrictions: <strong>Ingress</strong> for incoming connections and <strong>Egress</strong> for outgoing connections.</p>
<p>In both cases, if a <code>Pod</code> is not selected by any policy, a connection is allowed (in other words, it is not isolated or restricted). Remember that a <code>Pod</code> can be selected in several different ways as we saw in the previous example, including just by being part of a namespace that has been selected. But if no <strong>Ingress</strong> policies select a <code>Pod</code>, all inbound connections for that Pod will be allowed. Likewise, if no <strong>Egress</strong> policies select a <code>Pod</code>, all outbound connections from that Pod will be allowed.</p>
<p>If a Pod <em>is</em> selected by a policy, then <em>only the traffic allowed by the rules</em> of that policy will be allowed. <strong>Ingress</strong> and <strong>Egress</strong> are evaluated separately, however. Also, return traffic for any allowed connection is also implicitly allowed.</p>
<p>When deciding how to design our policies, remember that they are <em>additive</em>. Once a policy applies to a <code>Pod</code>, we’re starting with nothing being allowed and adding exceptions with our rules. This means that rules can’t conflict with each other, and the ordering of rules doesn’t matter.</p>
<p>We can create multiple policies that may apply to a single <code>Pod</code> or group of <code>Pods</code>. For this reason, it's recommended to create a policy per single <em>intention</em>. A connection from a frontend to a backend would qualify as a single intention. A backend may also accept connections from elsewhere, but these could be separated into additional policies.</p>
<h3 id="heading-some-sensible-defaults">Some sensible defaults</h3>
<p>In most environments it's likely that you may select specific groups of <code>Pods</code> for specific connection types, but you may still wish to apply defaults to all other <code>Pods</code> that don’t have their own unique rules. We can use the logic of the <code>NetworkPolicy</code> object to apply some sensible defaults quite easily.</p>
<p>For example, when you’ve finished creating specific ingress rules for some workloads, you might want to make sure that all other <code>Pods</code> are denied Ingress traffic, even if they aren’t specifically selected. This policy will do that:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">default-deny-ingress</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span> {}
  <span class="hljs-attr">policyTypes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
</code></pre>
<p>This works because we don’t specify a namespace, and we use an empty <code>podSelector</code>. Then we apply an empty ingress policy – essentially specifying that we want to control ingress, but without supplying any rules that allow it. This policy will have no effect on Egress traffic.</p>
<p>Conversely, maybe you want to explicitly allow all ingress connections. The policy is similar, but now we supply an empty <code>ingress</code> rule, thereby matching all incoming traffic:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">allow-all-ingress</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span> {}
  <span class="hljs-attr">ingress:</span>
  <span class="hljs-bullet">-</span> {}
  <span class="hljs-attr">policyTypes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
</code></pre>
<p>Remember what we said earlier about policy addition and conflict? What do you think would happen if we added an additional ingress policy to select some <code>Pods</code> and try to control their inbound traffic? The answer is: it would have no effect. No incoming network traffic can now be denied with this blanket policy in place.</p>
<p>Once again, controlling ingress like this has no effect on egress traffic. However, you can create the egress equivalent of these rules simply by changing the <code>policyType</code>, and swapping the empty <code>ingress</code> parameter for an empty <code>egress</code> parameter on the allow-all rule.</p>
<p>Finally, if you want a blanket rule to deny all traffic <em>except</em> for what is specified in other policies, you can deny everything in one go with this policy:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">default-deny-all</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span> {}
  <span class="hljs-attr">policyTypes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Egress</span>
</code></pre>
<p><code>NetworkPolicies</code> may seem like an additional chore, but you should simply consider them as an extension of any other configuration you create for your workload, alongside such standard objects as <code>Deployment</code> and <code>Service</code>. Doing the hard work up front means that in the unfortunate event that a workload is compromised or starts misbehaving, its potential to cause further damage inside your system is considerably reduced.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this post I’ve tried to cover some of the most essential security topics you need to know in addition to the security aspects of service mesh we already learned about. We’ve revisited Workload Identity, introduced Binary Authorization and discussed integrations with Cloud Armor and Cloud IAP. Finally, we’ve looked at <code>NetworkPolicies</code> and tried to normalize them as part of the standard configuration for any workload. In doing so, we’ve tried to emphasize that steps to improve security are a fundamental part of the way you design and operate your systems and applications. They’re not a “nice to have”; you ignore them at your peril. Do the work now and your future self will thank you!</p>
<p>But of course, we’ve barely touched the surface. Hopefully I’ve convinced you <em>why</em> this topic is important, but the whole world of Kubernetes security is too big to squeeze into a blog series that is ostensibly about operating GKE. I would strongly recommend researching the world of Pod Security Admission and Admission Controllers in general, Role Based Access Control (RBAC), and the various hardening guides available from the Kubernetes project and others.</p>
<p>The next post will be the last in this series (for now)! We may have learned a lot about GKE Enterprise and how to deploy its many features, but so far, we’ve been doing the work manually for the most part. Manual work doesn’t scale and is liable to human error, so to finish the series we’ll learn about automation and how modern DevOps tools and GitOps patterns can help us deploy securely again and again.</p>
<p>See you next time!</p>
<p><em>Cover image by</em> <a target="_blank" href="https://pixabay.com/users/tungart7-38741244/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=8656655"><em>Tung Lam</em></a> <em>from</em> <a target="_blank" href="https://pixabay.com//?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=8656655"><em>Pixabay</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Securing GKE Workloads with Service Mesh]]></title><description><![CDATA[This is the eighth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to sup...]]></description><link>https://timberry.dev/securing-gke-workloads-with-service-mesh</link><guid isPermaLink="true">https://timberry.dev/securing-gke-workloads-with-service-mesh</guid><category><![CDATA[google kubernetes engine]]></category><category><![CDATA[service mesh]]></category><category><![CDATA[#istio]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Fri, 17 Jan 2025 13:47:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737116535661/2d4859e6-0d71-4b93-a68d-2700687539ca.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the eighth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>In the last 2 posts we looked at how GKE Service Mesh can help us with things like multi-cluster networking and fleet ingresses, and in the final part of our tour of this particular feature we’ll look at the additional security features a mesh provides; this can be one of the most compelling reasons to implement a mesh in the first place. Historically, many security design decisions ended at the boundary of a single server or a single monolithic application. This was also true for networking components, where entities within a specific network boundary (such as a firewall or VPN) were trusted and not subject to further rigorous security measures.</p>
<p>Cloud native designs change these approaches completely. Organizations including Google, the Cloud Native Computing Foundation and even the US Department of Defense all recommend zero-trust approaches to security, where service identity must always be proven, and permissions granted on a least-privilege basis. This is important when migrating monolithic applications to microservices, as this increases the number of moving parts and thus the attack vector size. Individual services may be compromised and impersonate other services on a network, either disrupting applications or exposing sensitive information.</p>
<p>So, we need a way to make sure that all service interactions are authenticated, authorized and controlled. Luckily, this is something the Service Mesh excels at. In this post we’re going to cover the following main topics:</p>
<ul>
<li><p>Implementing mutual transport layer security to encrypt all traffic</p>
</li>
<li><p>Adding service account identities to our workloads</p>
</li>
<li><p>Controlling which services and namespaces can access each other</p>
</li>
</ul>
<p>As we’ve covered the bulk of how Service Mesh works in <a target="_blank" href="https://timberry.dev/gke-istio-and-managed-service-mesh">previous</a> <a target="_blank" href="https://timberry.dev/multi-cluster-networking-with-service-mesh-in-gke">posts</a>, this will be a lighter tour through these topics as they are engineered well into Service Mesh and easy to implement. At the end of the post, we’ll summarize what we’ve learned about Service Mesh, and use these topics to springboard into further security concepts.</p>
<h2 id="heading-service-identity-and-end-to-end-encryption">Service identity and end-to-end encryption</h2>
<p>Let’s start by discussing why securing individual workloads such as microservices is so important. Many people think that performing security up to the boundary of their cloud infrastructure is sufficient. Perhaps you’re terminating TLS connections at your load balancer, or even using a Web Application Firewall to provide additional protections. Once the traffic is inside your network, you may be assuming it is safe. But as our design for applications becomes more abstract and granular, we are introducing some more complexity which in turn leads to a wider attack vector. If a bad actor can infiltrate some vulnerable code in one part of your application stack, what’s to stop them impersonating any other part to steal or damage your data?</p>
<p>A Service Mesh allows us to secure an environment composed of multiple microservices by adding a secure identity to each of our workloads. In this scenario, a single workload may be a <code>Pod</code>, or a collection of replica <code>Pods</code> managed by some kind of controller such as a <code>Deployment</code> or <code>StatefulSet</code>. The workload may represent some part of an overall system, such as the checkout, cart or frontend components we saw in the Online Boutique demo earlier in this series of posts. In a typical cluster, service discovery means that any of these workloads can find and connect to any other workload. Typically, this internal traffic is not encrypted either. That means that if just one component becomes compromised, an attacker may be able to intercept data intended for other components.</p>
<p>To illustrate this, let’s imagine that our frontend workload has been compromised. As it uses the same CIDR as other workloads, it could potentially intercept their network traffic. Attack code could even be run on this frontend to simulate a checkout service and infiltrate the payment service backend.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737114134660/11cccfd6-510c-42be-86d3-d206c2012a83.png" alt class="image--center mx-auto" /></p>
<p>The first part of remediating this is to encrypt the data sent between individual microservices. Historically this would have introduced a significant management burden, creating, assigning and distributing TLS certificates to each microservice, and making sure that each service trusts the certificate chain used to issue them. But now, Service Mesh can automate this entirely by simply enabling the policy and leveraging Google’s Mesh CA service. We’ll see how to do this in a moment.</p>
<p>The second part of our approach is to be <em>descriptive</em> with which services should be allowed to communicate with which other services, and in what ways. Now that your microservices are encrypted with their own client certificate, they will all have their own unique, provable identity. That means we can create rules to lock down communication between microservices to only the interactions that we know should be happening, therefore reducing the attack surface for bad actors.</p>
<p>Best of all, the implementation of these new security endeavors is all handled by the Envoy sidecar container, so in most cases no changes to actual application containers are required. Let’s look at how we set these features up in our clusters.</p>
<h3 id="heading-enabling-secure-two-way-communication">Enabling secure two-way communication</h3>
<p>Mutual TLS (or mTLS) is implemented by Istio in Cloud Service Mesh using the Envoy proxy sidecar containers running alongside your workloads. mTLS has two operating modes:</p>
<ul>
<li><p><strong>Permissive</strong>: Sidecar proxies will use TLS to connect with other sidecar proxies but will also allow non-TLS communication for incoming connections or connections to other workloads without sidecars.</p>
</li>
<li><p><strong>Strict</strong>: Only TLS connections will be allowed.</p>
</li>
</ul>
<p>By default, Cloud Service Mesh will enable <strong>permissive</strong> mTLS across your cluster out-of-the-box. However, it’s recommended to switch to <strong>strict</strong> mode to secure your mesh. This configuration is done with a <code>PeerAuthentication</code> object in the <code>istio-system</code> namespace. Istio considers this the root of your mesh and will apply the configuration to every namespace where Istio injection is enabled.</p>
<p>To secure your entire mesh, you could apply the object with the following YAML:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"security.istio.io/v1beta1"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"PeerAuthentication"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"mesh-wide-mtls"</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">"istio-system"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">mtls:</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">STRICT</span>
</code></pre>
<p>Alternatively, you could use the above YAML to apply a strict mTLS policy to a specific namespace by simply changing the namespace referenced, or removing it entirely from the YAML and adding it dynamically with <code>kubectl -n</code>, which can be useful for scripting.</p>
<p>In some scenarios, we may want to enable strict mTLS only for specific <em>workloads</em> rather than an entire namespace. We can do this by adding selectors to the <code>PeerAuthentication</code> object so that it only targets specific <code>Pods</code>.</p>
<p>For example, if our <code>Pods</code> have the metadata label <code>app=frontend</code>, we could use the following YAML definition:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"security.istio.io/v1beta1"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"PeerAuthentication"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"frontend-mtls"</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">"frontend"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">mtls:</span>
    <span class="hljs-attr">mode:</span> <span class="hljs-string">STRICT</span>
</code></pre>
<p>However, Cloud Service Mesh can’t aggregate workload-level policies for outbound mTLS traffic to a service, so we also need to add a matching <code>DestinationRule</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"networking.istio.io/v1alpha3"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"DestinationRule"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"frontend-dr-mtls"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">host:</span> <span class="hljs-string">"frontend.demo.svc.cluster.local"</span>
  <span class="hljs-attr">trafficPolicy:</span>
    <span class="hljs-attr">tls:</span>
      <span class="hljs-attr">mode:</span> <span class="hljs-string">ISTIO_MUTUAL</span>
</code></pre>
<p>With these objects in place, incoming requests to the frontend service will now require TLS. We’ll need to be mindful of what incoming connections we expect as unencrypted requests will now be rejected. Usually, we rely on an Ingress gateway to accept a request from an external user (either directly or via a Cloud Load Balancer). The Ingress Gateway Pods would then require mTLS to communicate securely with the frontend.</p>
<h2 id="heading-using-policies-to-authorize-connections">Using policies to authorize connections</h2>
<p>As we’ve already discussed, encrypting network traffic between workloads is just one part of the solution. However, in forcing the use of certificates for mTLS, we can now also trust that the originator of a request is who they say they are. For example, a proxy accepting requests from the <strong>frontend</strong> workload over TLS can trust that the request <em>really is</em> from <strong>frontend</strong> because of the certificate being used.</p>
<p>Once we’ve authenticated a request with its certificate, Cloud Service Mesh also allows us to create rules regarding authorization. In other words, we trust the identity of this workload, but do we want to allow them to make the connection?</p>
<p>In a microservices architecture, authorization policies can be useful to control which workloads are allowed to communicate with which other workloads. Remember, in the ephemeral world of Kubernetes, our <code>Pods</code> will have constantly changing IP addresses, so traditional firewall rules simply no longer make sense. Instead, we can use an <code>AuthorizationPolicy</code> to define a set of rules for traffic that we will permit, based on either a workload or namespace identity.</p>
<h3 id="heading-the-structure-of-an-authorizationpolicy">The structure of an AuthorizationPolicy</h3>
<p>The <code>AuthorizationPolicy</code> object is a custom resource which, when applied to our cluster, will create rules on all of the affected sidecar proxies in scope. An individual policy object can target the entire mesh, a namespace or an individual workload using selectors in a similar fashion to the <code>PeerAuthentication</code> object we saw earlier.</p>
<p>Policies are then comprised of an <strong>action</strong> and optionally (although usually!) some <strong>rules</strong>:</p>
<ul>
<li><p>The <strong>action</strong> is usually set to <code>ALLOW</code> or <code>DENY</code> the matching request. As we’ll see in a moment, there’s a slightly counter-intuitive way to build sets of rules using multiple <code>ALLOW</code>s without any <code>DENY</code>s for most use cases. In advanced use cases, the action can be <code>CUSTOM</code> which allows you to delegate the access control to an external authorization system. Finally, you can also specify an action of <code>AUDIT</code>, which causes the request to simply be logged and has no bearing on whether the connection is allowed or not. Normally, audit rules are applied in addition to allow and deny rules to assist with troubleshooting.</p>
</li>
<li><p>The <strong>rules</strong> define which requests will match and should be affected by the action. Within the object’s rules, we can specify the source of the request in the <code>from</code> section. The <code>to</code> section allows us to specify which operations should be permitted (such as HTTP GET, POST etc.). It may feel like the <code>to</code> section should specify the target of our policy, but don’t forget that we specify this in a <em>selector</em>, if we’re trying to target a specific workload. Finally, we can apply some additional conditions in a <code>when</code> field, including request headers that must match for us to apply the policy.</p>
</li>
</ul>
<p>Let’s see some example <code>AuthorizationPolicy</code> objects to illustrate how these conventions work. Here’s a basic policy to allow all requests to <code>Pods</code> that match the selector <code>app=frontend</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"security.istio.io/v1beta1"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"AuthorizationPolicy"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"frontend-ap"</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">frontend</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">action:</span> <span class="hljs-string">ALLOW</span>
</code></pre>
<p>This is obviously a very permissive policy, so we could consider locking it down a bit by adding some approved operations. In the updated version below, we’ll specify that we only accept <code>HTTP GET</code> requests for paths that start with <code>/public</code> or <code>/test</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">security.istio.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">AuthorizationPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">tester</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">default</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">products</span>
  <span class="hljs-attr">action:</span> <span class="hljs-string">ALLOW</span>
  <span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">to:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">operation:</span>
        <span class="hljs-attr">methods:</span> [<span class="hljs-string">"GET"</span>]
        <span class="hljs-attr">paths:</span> [<span class="hljs-string">"/public/*"</span>, <span class="hljs-string">"/test/*"</span>]
</code></pre>
<p>Matching keywords like <code>paths</code> can also be used with their negative condition versions, which in this example would be <code>notPaths</code>. Including a negative condition in a policy would mean that the policy applies as normal but will <em>not</em> be applied to paths specified in the <code>notPaths</code> parameter.</p>
<p>For a full list of supported operations, see <a target="_blank" href="https://istio.io/latest/docs/reference/config/security/authorization-policy/#Operation">https://istio.io/latest/docs/reference/config/security/authorization-policy/#Operation</a></p>
<h3 id="heading-using-service-accounts-with-mesh-identities">Using service accounts with mesh identities</h3>
<p>A common pattern used is to specify policies that grant access based on the identity of the caller. As we’ve already mentioned, Cloud Service Mesh provides each service with a secure identity, so we know who is calling. We can then create policies that only allow requests from the identified callers we choose. However, we don’t use the details of a TLS certificate in our <code>AuthorizationPolicy</code> objects, we instead reference a Service Account.</p>
<p>Service Accounts in Kubernetes give a distinct identity to a workload and work with Role Based Access Control to provide a way to control which objects a workload may access. You’re probably already using them in your own Kubernetes projects; if you’re not, I definitely recommend looking them up in the Kubernetes documentation! So how do they relate to <code>AuthorizationPolicies</code>?</p>
<p>Let’s imagine a typical scenario where we have 2 workloads: <strong>frontend</strong> and <strong>backend</strong>, and we want to only allow requests to the <strong>backend</strong> service <em>if</em> they come from the <strong>frontend</strong> workload. In other words, we shouldn’t allow direct connections to <strong>backend</strong> from anywhere else.</p>
<p>Here’s some example YAML we could use to create a unique service account identity for the frontend, and make sure it’s being used in the frontend deployment:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-sa</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">frontend-image</span>
      <span class="hljs-attr">serviceAccountName:</span> <span class="hljs-string">frontend-sa</span>
</code></pre>
<p>We can now create an <code>AuthorizationPolicy</code> for our backend service, which will only allow access from the frontend service account as a <strong>principal</strong>. In Istio, principals are identified by their certificate authority, which in Cloud Service Mesh on Google Cloud is the name of your project. With a project ID of <code>my-project-id</code>, your policy YAML could look like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"security.istio.io/v1beta1"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"AuthorizationPolicy"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"backend-access"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">backend</span>
  <span class="hljs-attr">action:</span> <span class="hljs-string">ALLOW</span>
  <span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">source:</span>
        <span class="hljs-attr">principals:</span> [<span class="hljs-string">"my-project-id.svc.id.goog/ns/default/sa/frontend-sa"</span>]
</code></pre>
<p>When a request comes from the identity of the <code>frontend-sa</code> service account, it will be allowed by this policy, as illustrated below. Requests can also be allowed from entire namespaces rather than individual principles using the <code>namespaces</code> source, and just like other conditions, we can construct rules with negative conditions such as <code>notPrincipals</code> and <code>notNamespaces</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737114814462/9d2f6e2a-9e68-4089-a3f8-ef456df75c7d.png" alt class="image--center mx-auto" /></p>
<p>For a full list of supported options for the source of a request, see <a target="_blank" href="https://istio.io/latest/docs/reference/config/security/authorization-policy/#Source">https://istio.io/latest/docs/reference/config/security/authorization-policy/#Source</a></p>
<h3 id="heading-assembling-and-layering-policies">Assembling and layering policies</h3>
<p>As we mentioned earlier, you can stack <code>ALLOW</code> and <code>DENY</code> policies however you want to, but the way they are processed can be a little counterintuitive. The most important thing to remember is that <code>DENY</code> policies are evaluated first. If a single <code>DENY</code> policy matches a request, that request will be denied before any <code>ALLOW</code> policies have been evaluated. In some systems, it’s customary to use something akin to a <code>DENY ALL</code> rule and then build up specific <code>ALLOW</code> rules based on desired behaviors. With <code>AuthorizationPolicies</code>, that won’t work as the <code>ALLOW</code> rules will never be seen.</p>
<p>The recommended approach for this pattern is instead to use an <code>ALLOW</code> rule that matches nothing. In the absence of other <code>ALLOW</code> rules, this will cause all requests to be denied (in the same spirit as a <code>DENY ALL</code> rule). However, we can now add additional <code>ALLOW</code> rules to grant the policies we want to, and they will still be evaluated. The “Allow nothing” rule looks like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">security.istio.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">AuthorizationPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">allow-nothing</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">action:</span> <span class="hljs-string">ALLOW</span>
</code></pre>
<p>Conversely, we might choose to set up an “Allow all” rule. This is similar but contains an empty <code>rules</code> block that matches all workloads:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">security.istio.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">AuthorizationPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">allow-all</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">action:</span> <span class="hljs-string">ALLOW</span>
  <span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> {}
</code></pre>
<p>With an “Allow all” rule in place, we would have to build multiple <code>DENY</code> policies as well to stop undesirable connections. These would be evaluated first and could still block traffic, before any unmatched request is handled by the “Allow all” rule. As you can imagine, this approach leads to a lot of management overhead.</p>
<p>For this reason, the “Allow nothing” approach is definitley recommended. I just wanted to show you both approaches to help you understand the concepts involved.</p>
<h3 id="heading-testing-policies-before-enforcing-them">Testing policies before enforcing them</h3>
<p>Cloud Service Mesh also supports an Istio annotation that applies your policies in a “dry-run” mode. If a policy matches, it will always be allowed but the enforcement result will be written to Cloud Logging. You can enable “dry-run” mode on any <code>AuthorizationPolicy</code> by adding the following annotation to the object’s metadata:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">"istio.io/dry-run":</span> <span class="hljs-string">"true"</span>
</code></pre>
<p>This feature can be useful if you need to test policies on live production traffic without actually enforcing it straight away.</p>
<p>Using policies to define what network connections count as legal and valid is a good way to document your system and enforce an important security principle. If you reduce the permitted connections between proxies to only the traffic that you know is required for your application stack, you are reducing the potential attack surface that could be exploited by a bad actor who has gained access to any component of your system. Policies can also protect components from services that are incorrectly configured or have been recently changed without proper testing.</p>
<h2 id="heading-summary">Summary</h2>
<p>We’ve now completed our journey through Cloud Service Mesh, a powerful and fully managed implementation of the popular Istio stack that is a key offering in GKE Enterprise. Hopefully the concepts of Service Mesh have now been demystified for you, and you’ll be able to determine if using Service Mesh is the right choice for your environments and workloads in the future.</p>
<p>Some argue against Service Mesh due to the additional complexity it can introduce. It’s true that using it effectively will require you to think about additional layers of configuration for your workloads. However, with Cloud Service Mesh it's never been easier to get started and leverage the benefits of Istio. Ultimately the level of increased complexity you want to add to your environment may depend on the original complexity of your application stack. Environments that may not warrant a mesh include single stateless services, or long-running applications with consistent levels of demand. Such environments are rare these days however, and if you’re already considering GKE Enterprise, chances are you have a complex microservices stack to deploy. Despite the extra work, a mesh will reward you by making traffic management, observability and security easier in the long run.</p>
<p>While we’ll be moving away from discussing Service Mesh, in my next post I’ll continue with the theme of security. Traffic control is just one part of our security toolbelt in Kubernetes, so stay tuned for more guidance on workload identity, binary authorization, network policies and how we integrate Google Cloud’s native security tools into our GKE workloads.</p>
<p><em>Cover image by</em> <a target="_blank" href="https://pixabay.com/users/tungart7-38741244/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=8760347"><em>Tung Lam</em></a> <em>from</em> <a target="_blank" href="https://pixabay.com//?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=8760347"><em>Pixabay</em></a></p>
]]></content:encoded></item><item><title><![CDATA[Multi-cluster Networking with Service Mesh in GKE]]></title><description><![CDATA[This is the seventh post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to su...]]></description><link>https://timberry.dev/multi-cluster-networking-with-service-mesh-in-gke</link><guid isPermaLink="true">https://timberry.dev/multi-cluster-networking-with-service-mesh-in-gke</guid><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[service mesh]]></category><category><![CDATA[multi cluster]]></category><category><![CDATA[#istio]]></category><category><![CDATA[istio service mesh]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Fri, 29 Nov 2024 11:41:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732879663680/e1fdf5e0-5a2a-4761-bd6f-d8aea01699d0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the seventh post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>In this post we’ll continue our journey into the wonderful world of Cloud Service Mesh by exploring how we extend a mesh across multiple clusters, including clusters in different subnets, projects and even clouds! So far, we’ve discussed several scenarios where running multiple clusters can be beneficial, above and beyond simple scalability and high availability. Often there are requirements for elements of a workload stack to run in a specific geographic location, and even on specific hardware on-premises. GKE Enterprise allows us to manage clusters that run on our own hardware, and combining this approach with a service mesh allows us to also leverage things like service discovery and secure microservices in a true hybrid cloud environment.</p>
<p>Cloud Service Mesh provides multi-cluster capabilities and endpoint discovery across GKE clusters in Google Cloud automatically, but extending this approach to managed GKE clusters in other clouds and attached Kubernetes clusters on-premises presents some new challenges. Building on the foundational service mesh topics we covered in the last post, we’ll add these multi-cluster scenarios to our tool belt, before approaching service mesh security in my next post. After that, you’ll have all the knowledge you need to know if service mesh is right for your workloads.</p>
<p>So, in this post, we’ll cover the following topics:</p>
<ul>
<li><p>Understanding the different operating modes of Cloud Service Mesh</p>
</li>
<li><p>Setting up service mesh discovery across multiple clusters</p>
</li>
<li><p>Extending service mesh outside of Google Cloud</p>
</li>
</ul>
<p>Just like in previous posts, I’ll also walk through a working example so you can see how all the pieces fit together.</p>
<h2 id="heading-types-of-multi-cluster-mesh">Types of multi-cluster mesh</h2>
<p>Google’s Cloud Service Mesh supports several modes of installation and operation, with varying levels of complexity:</p>
<ul>
<li><p><strong>Managed Multi-Cluster Mesh on Google Cloud</strong>: This is the most straightforward option and works almost completely “out of the box”, leveraging Cloud Service Mesh as a fully managed service and providing endpoint discovery between all GKE clusters inside Google Cloud. Clusters can exist in the same project or in different projects, providing they run on the same VPC network and in the same fleet. Shared VPC can be used to manage multi-project networking.</p>
</li>
<li><p><strong>In-Cluster Mesh inside Google Cloud</strong>: This option uses the <code>asmcli</code> tool to install the Istio control plane directly into your clusters. It gives you complete control over all service mesh components, but this means you can no longer rely on Cloud Service Mesh as a managed service; you will need to maintain and update the Istio components yourself. In-Cluster Mesh is no longer recommended for clusters running inside Google Cloud, except for in outlier use cases.</p>
</li>
<li><p><strong>In-Cluster Mesh outside Google Cloud</strong>: Using the <code>asmcli</code> tool you can also install Cloud Service Mesh in its non-managed form on GKE clusters running in VMWare, Azure, AWS and even bare metal. Rather than install Istio from scratch, using the In-Cluster mesh service still allows you to observe your mesh through the Google Cloud console, and provides some additional assistance with things like east-west gateways.</p>
</li>
</ul>
<p>The limitation of each of these approaches is that the multi-cluster mesh will only extend to additional clusters in the same environment. A true hybrid mesh would allow the same service mesh installation to extend across your Google Cloud projects, other clouds and even into your own datacenter. At the time of writing, hybrid mesh is still in preview and is not supported by Google. However, we will briefly cover this approach at the end of the post.</p>
<h2 id="heading-meshing-clusters-within-google-cloud">Meshing clusters within Google Cloud</h2>
<p>Meshing two or more GKE clusters within Google Cloud is supported by default with Cloud Service Mesh. The main prerequisites for service mesh are that all clusters belong to the same fleet, and that connectivity between all Pods in all clusters is allowed. This can be achieved in one of two ways:</p>
<ul>
<li><p>Clusters in the same project on the same network</p>
</li>
<li><p>Clusters in different project sharing the same network via Shared VPC</p>
</li>
</ul>
<p>There are some additional considerations for clusters in different subnets in a VPC, and private clusters, but we’ll cover those later on. For now, let’s walk through a basic multi-cluster mesh example.</p>
<p>To try out a multi-cluster mesh, we just need to ensure that Cloud Service Mesh is enabled for our fleet, and then create 2 GKE clusters, ensuring that they are registered to the fleet at the time of creation (we covered how to set this up in the <a target="_blank" href="https://timberry.dev/gke-istio-and-managed-service-mesh">last post</a>).</p>
<p>In the following steps we’ll refer to these clusters as <code>cluster-1</code> and <code>cluster-2</code>. If you’re following along, make sure the output of <code>gcloud container fleet mesh describe</code> shows that everything in your mesh is ready to go.</p>
<h3 id="heading-testing-the-mesh-with-a-sample-app">Testing the mesh with a sample app</h3>
<p>To test the cross-cluster mesh, we’ll deploy a very basic <strong>Hello World</strong> application to both clusters. The app will expose a service that replies with “Hello World”, but with a different version number for each cluster so we can trace which cluster serves the request. Because we now have multi-cluster service discovery, the <code>Service</code> object we call will automatically include Pods from both clusters in its endpoints!</p>
<p>Because we’re going to be running commands on two clusters in this walkthrough, we’re going to employ some handy tricks:</p>
<ul>
<li><p>We’ll write YAML manifest files that contain multiple objects, so we’ll use <code>-l</code> to specify that only the objects that match specific labels should be created.</p>
</li>
<li><p>We’ll be using the <code>--context</code> option with <code>kubectl</code> to specify which cluster to apply to. You should already have two contexts set up in your <code>kubeconfig</code> for your clusters, but you can rename them to something more convenient like <code>cluster-1</code> and <code>cluster-2</code> using <code>kubectl config rename-context</code>.</p>
</li>
</ul>
<p>So, the first thing to do is create a namespace for our app and label it for Istio’s automatic sidecar injection. Using the tricks above, we’ll do this for both clusters:</p>
<pre><code class="lang-bash">kubectl --context=cluster-1 create ns sample
kubectl --context=cluster-1 label namespace sample istio-injection=enabled
kubectl --context=cluster-2 create ns sample
kubectl --context=cluster-2 label namespace sample istio-injection=enabled
</code></pre>
<p>Now let’s create the <code>helloworld.yaml</code> file that contains the <code>Deployment</code> and <code>Service</code> definitions we need:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">helloworld</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
    <span class="hljs-attr">service:</span> <span class="hljs-string">helloworld</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">5000</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">helloworld-v1</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
    <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
      <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
        <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">helloworld</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/istio/examples-helloworld-v1:1.0</span>
        <span class="hljs-attr">resources:</span>
          <span class="hljs-attr">requests:</span>
            <span class="hljs-attr">cpu:</span> <span class="hljs-string">"100m"</span>
        <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">IfNotPresent</span> <span class="hljs-comment">#Always</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">5000</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">helloworld-v2</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
    <span class="hljs-attr">version:</span> <span class="hljs-string">v2</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
      <span class="hljs-attr">version:</span> <span class="hljs-string">v2</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">helloworld</span>
        <span class="hljs-attr">version:</span> <span class="hljs-string">v2</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">helloworld</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/istio/examples-helloworld-v2:1.0</span>
        <span class="hljs-attr">resources:</span>
          <span class="hljs-attr">requests:</span>
            <span class="hljs-attr">cpu:</span> <span class="hljs-string">"100m"</span>
        <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">IfNotPresent</span> <span class="hljs-comment">#Always</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">5000</span>
</code></pre>
<p>Using this file and our labels trick, we’ll first create the <code>Service</code> object we need on both clusters:</p>
<pre><code class="lang-bash">kubectl create --context=cluster-1 -l service=helloworld -n sample -f helloworld.yaml
kubectl create --context=cluster-2 -l service=helloworld -n sample -f helloworld.yaml
</code></pre>
<p>And now the <code>Deployment</code> objects. Note that we create a different version on each cluster:</p>
<pre><code class="lang-bash">kubectl create --context=cluster-1 -l version=v1 -n sample -f helloworld.yaml
kubectl create --context=cluster-2 -l version=v2 -n sample -f helloworld.yaml
</code></pre>
<p>Try using the <code>--context</code> method to check the <code>Service</code> and <code>Deployment</code> objects on both clusters, as well as listing the running <code>Pods</code>. You should see that each <code>Pod</code> has two containers, as the sidecar proxies have been injected.</p>
<p>At this point we’ve got our sample app deployed, but how do we test it? In the next step we’ll create a simple <code>Deployment</code> that we can use to make some in-cluster requests to our newly deployed service. If we make repetitive requests, we should hopefully see the responses getting balanced across both clusters.</p>
<p>In <code>sleep.yaml</code> we essentially create a dummy application that just gives us an in-cluster environment for us to run <code>curl</code> commands and other network tests:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sleep</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sleep</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">sleep</span>
    <span class="hljs-attr">service:</span> <span class="hljs-string">sleep</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">sleep</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">sleep</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">sleep</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">sleep</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">terminationGracePeriodSeconds:</span> <span class="hljs-number">0</span>
      <span class="hljs-attr">serviceAccountName:</span> <span class="hljs-string">sleep</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">sleep</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">curlimages/curl</span>
        <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/sleep"</span>, <span class="hljs-string">"infinity"</span>]
        <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">IfNotPresent</span>
        <span class="hljs-attr">volumeMounts:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/etc/sleep/tls</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">secret-volume</span>
      <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">secret-volume</span>
        <span class="hljs-attr">secret:</span>
          <span class="hljs-attr">secretName:</span> <span class="hljs-string">sleep-secret</span>
          <span class="hljs-attr">optional:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Now deploy the app to both clusters:</p>
<pre><code class="lang-bash">kubectl apply --context=cluster-1 -n sample -f sleep.yaml
kubectl apply --context=cluster-2 -n sample -f sleep.yaml
</code></pre>
<p>We can now execute a command within the <code>sleep</code> container to test the Hello World application. You can run this test from either cluster, but for simplicity we won’t specify a context in the command examples below.</p>
<p>First, we get the name of a sleep Pod:</p>
<pre><code class="lang-bash">kubectl get pod -n sample -l app=sleep
</code></pre>
<p>Now we can use that <code>Pod</code> to execute a loop of <code>curl</code> commands to access the Hello World service. In the following example, replace <code>&lt;POD_NAME&gt;</code> with the Pod name you grabbed from the previous command:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -n sample -c sleep &lt;POD_NAME&gt; \
  -- /bin/sh -c <span class="hljs-string">'for i in $(seq 1 20); do curl -sS helloworld.sample:5000/hello; done'</span>
</code></pre>
<p>You should see 20 responses, roughly distributed across both clusters. Once again, <code>Pods</code> from either cluster (represented by versions 1 and 2) will match endpoints for the service. Multi-cluster mesh and service discovery in action!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732878058254/adedf667-e975-4e20-8151-73e7b54df20c.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-exposing-a-meshed-service-to-a-load-balancer">Exposing a meshed service to a load balancer</h3>
<p>We can take what we learned in our previous post and expose a multi-cluster service via a service mesh <code>VirtualService</code> and a <code>Gateway</code>. However, bear in mind that in this scenario, the <code>Pods</code> running the ingress itself will still only run in a single cluster (we’ll sum up multi-cluster load-balancing options later on).</p>
<p>First, we need to install the Istio ingress gateway <code>Deployment</code>. Recall from my previous post that these are the standalone envoy proxies running at the edge of our cluster. The easiest way to do this is to use the manifests provided by Google in this repo: <a target="_blank" href="https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages.git">https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages.git</a></p>
<p>Let’s create a dedicated namespace for the ingress gateway, and label it for sidecar injection. We only need to do this on <code>cluster-1</code> for this example, so set your <code>kubectl</code> context appropriately.</p>
<pre><code class="lang-bash">kubectl create ns gateway-ns
kubectl label namespace gateway-ns istio-injection=enabled
</code></pre>
<p>From the git repo directory, navigate to <code>samples/gateways</code>, where you should find a directory called <code>istio-ingressgateway</code>. This contains manifests for the <code>Deployment</code>, along with the other supporting objects we need. Deploy it to your cluster with this command:</p>
<pre><code class="lang-bash">kubectl -n gateway-ns apply -f istio-ingressgateway/
</code></pre>
<p>With our ingress <code>Pods</code> running, we can now configure the <code>Gateway</code> and <code>VirtualService</code> objects. Remember that the <code>Gateway</code> object defines <em>how</em> the ingress gateway Pods should be configured, including information such as hosts, protocols and ports. The <code>VirtualService</code> object will then define the <em>routes</em> that our gateway should use, mapping traffic from our gateway to our backend services.</p>
<p>First we’ll create the <code>sample-gateway.yaml</code> object and apply it to <code>cluster-1</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span> 
<span class="hljs-attr">kind:</span> <span class="hljs-string">Gateway</span> 
<span class="hljs-attr">metadata:</span> 
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-gateway</span> 
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">sample</span> 
<span class="hljs-attr">spec:</span> 
  <span class="hljs-attr">selector:</span> 
    <span class="hljs-attr">istio:</span> <span class="hljs-string">ingressgateway</span> 
  <span class="hljs-attr">servers:</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> 
      <span class="hljs-attr">number:</span> <span class="hljs-number">80</span> 
      <span class="hljs-attr">name:</span> <span class="hljs-string">http</span> 
      <span class="hljs-attr">protocol:</span> <span class="hljs-string">HTTP</span> 
    <span class="hljs-attr">hosts:</span> 
    <span class="hljs-bullet">-</span> <span class="hljs-string">"*"</span>
</code></pre>
<p>Then we create the <code>sample-virtualservice.yaml</code> object and also apply it to <code>cluster-1</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span> 
<span class="hljs-attr">kind:</span> <span class="hljs-string">VirtualService</span> 
<span class="hljs-attr">metadata:</span> 
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-ingress</span> 
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">sample</span> 
<span class="hljs-attr">spec:</span> 
  <span class="hljs-attr">hosts:</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-string">"*"</span> 
  <span class="hljs-attr">gateways:</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-string">frontend-gateway</span> 
  <span class="hljs-attr">http:</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span> 
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span> 
        <span class="hljs-attr">host:</span> <span class="hljs-string">helloworld</span> 
        <span class="hljs-attr">port:</span> 
          <span class="hljs-attr">number:</span> <span class="hljs-number">5000</span>
</code></pre>
<p>Our multi-cluster meshed service is now exposed to the world via a Cloud Load Balancer! To test requests, you can find the external IP by listing the services in the <code>gateway-ns</code> namespace:</p>
<pre><code class="lang-bash">kubectl -n gateway-ns get svc
</code></pre>
<p>Then just hit <code>http://&lt;EXTERNAL-IP&gt;/hello</code> in your browser, substituting <code>&lt;EXTERNAL-IP&gt;</code> for the IP address you noted from the previous command. You should see the Hello World app respond, and if you keep reloading the page you’ll get responses from version 1 (on cluster 1) and version 2 (on cluster 2).</p>
<h3 id="heading-considering-other-options-for-multi-cluster-load-balancing">Considering other options for multi-cluster load balancing</h3>
<p>So far, we’ve walked through a simple demonstration to help learn the concepts of service mesh in a multi-cluster environment, but we’ve still arrived at an environment where ingress is dependent on a single cluster. This is not going to be suitable for many production use-cases, so where do we go from here?</p>
<p>There are a few different options to choose from for a multi-cluster approach to load balancing that does not require a single point of failure. For extensive details on each you can refer to the documentation here: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/concepts/choose-mc-lb-api">https://cloud.google.com/kubernetes-engine/docs/concepts/choose-mc-lb-api</a></p>
<p>The recommended option is to use a <strong>Multi-Cluster Gateway</strong>. In the example we’ve just set up, our single-cluster gateway simply exposes an external service, connecting itself to a Cloud Load Balancer, but the gateway is an on-cluster resource.</p>
<p>However, as I covered in a <a target="_blank" href="https://timberry.dev/fleet-ingress-options-for-gke-enterprise#heading-multi-cluster-gateways">previous post,</a> we can leverage the <strong>GKE Gateway Controlle</strong>r to provide a fully managed off-cluster controller than interacts with our <code>Gateway</code> classes, <code>VirtualServices</code> and <code>HTTPRoutes</code>, while providing reliable and highly available cross-cluster load balancing. The GKE Controller works with both Standard and Autopilot clusters, and it simply provides a management layer over standard open-source Kubernetes objects, rather than requiring any proprietary implementation.</p>
<p>Another alternative is the <strong>GKE Multi Cluster Ingress</strong> controller. This controller also runs off-cluster as a managed service, but it requires less configuration by simply leveraging the <code>MultiClusterService</code> object. However, that means it doesn’t support all the advanced features of the Gateway API we’ve learned about.</p>
<p>The final option is to configure your own <strong>Network Endpoint Groups</strong> (NEGs) for Cloud Load Balancers. This is an advanced option as it will require the use of Terraform and the Config Connector to automate the assignment of Pod IPs to Load Balancer backends. This might be a useful option for outlier use cases such as combining backends from GKE clusters with serverless workloads, private endpoints or hybrid cloud endpoints in the same load balancer. As always, the best way forward is to consider exactly what your use case is and make your design choices based on exactly what your use case needs to achieve. The simplest choice may often be the best but try to consider if you may need additional features in the future!</p>
<h3 id="heading-using-clusters-in-different-subnets">Using clusters in different subnets</h3>
<p>In the example we walked through, all nodes in our clusters existed on the same VPC subnet. By default, GKE will create firewall rules that allow traffic between nodes on the same subnet, so in this scenario no additional firewall configuration is required. However, in many cases you will be using different subnets, and you must create these rules yourself.</p>
<p>It’s recommended to allow all ports between your cluster nodes on their internal IP addresses. We can achieve this by simply adding a firewall rule to the VPC. However, we will need lists of all subnet CIDRs and network tags used by our nodes. We can do some Bash trickery to obtain these lists and store them in variables, which we’ll use in a moment:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">function</span> join_by { <span class="hljs-built_in">local</span> IFS=<span class="hljs-string">"<span class="hljs-variable">$1</span>"</span>; <span class="hljs-built_in">shift</span>; <span class="hljs-built_in">echo</span> <span class="hljs-string">"$*"</span>; }
ALL_CLUSTER_CIDRS=$(gcloud container clusters list --project <span class="hljs-variable">$PROJECT_1</span> --format=<span class="hljs-string">'value(clusterIpv4Cidr)'</span> | sort | uniq)
ALL_CLUSTER_CIDRS=$(join_by , $(<span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${ALL_CLUSTER_CIDRS}</span>"</span>))
ALL_CLUSTER_NETTAGS=$(gcloud compute instances list --project <span class="hljs-variable">$PROJECT_1</span> --format=<span class="hljs-string">'value(tags.items.[0])'</span> | sort | uniq)
ALL_CLUSTER_NETTAGS=$(join_by , $(<span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${ALL_CLUSTER_NETTAGS}</span>"</span>))
</code></pre>
<p>Now we have the <code>ALL_CLUSTER_CIDRS</code> and <code>ALL_CLUSTER_NETTAGS</code> variables populated, we can run the following command to create a firewall rule called <code>isitio-multicluster-pods</code>. You’ll just need to substitute <code>YOUR_VPC</code> with the name of your VPC network:</p>
<pre><code class="lang-bash">gcloud compute firewall-rules create istio-multicluster-pods \
    --allow=tcp,udp,icmp,esp,ah,sctp \
    --direction=INGRESS \
    --priority=900 \
    --source-ranges=<span class="hljs-string">"<span class="hljs-variable">${ALL_CLUSTER_CIDRS}</span>"</span> \
    --target-tags=<span class="hljs-string">"<span class="hljs-variable">${ALL_CLUSTER_NETTAGS}</span>"</span> --quiet \
    --network=YOUR_VPC
</code></pre>
<p>All your <code>Pods</code> will now have free-flowing communication between clusters, providing the underlying VPC routes support the connections.</p>
<h3 id="heading-considerations-for-private-clusters">Considerations for private clusters</h3>
<p>If you are using private GKE clusters, you will not have public endpoints that can be accessed by the service mesh control plane. This means you have quite a few more hoops to jump through to get a multi-cluster mesh working. The instructions are quite complex, so rather than list them all here I will simply discuss them so you understand what they accomplish, then point you at the Google Cloud documentation page that will build the necessary commands for you.</p>
<p>At a high level:</p>
<ul>
<li><p>You will first need to configure endpoint discovery. This involves manually creating remote secrets to represent the private IPs of the cluster (because public IPs are not available). You will then need to apply each cluster’s secrets to the other clusters.</p>
</li>
<li><p>Then you will need to configure authorised networks for the private clusters. This will involve getting the CIDR ranges used for Pods from each cluster and adding them as authorized networks to other clusters.</p>
</li>
</ul>
<p>Just to make this process even trickier, these instructions are detailed across multiple pages of documentation! The instructions for endpoint discovery can be found here: <a target="_blank" href="https://cloud.google.com/service-mesh/docs/operate-and-maintain/multi-cluster#endpoint-discovery-declarative-api">https://cloud.google.com/service-mesh/docs/operate-and-maintain/multi-cluster#endpoint-discovery-declarative-api</a> and the instructions for opening ports on private clusters can be found here: <a target="_blank" href="https://cloud.google.com/service-mesh/docs/operate-and-maintain/private-cluster-open-port">https://cloud.google.com/service-mesh/docs/operate-and-maintain/private-cluster-open-port</a></p>
<p>A final consideration for private clusters is that they will not, by default, have access to the Internet to download the container images from Docker Hub that we have used in this post. To solve this, you can either manually download and add container images to Google’s Artifact Registry or configure a Cloud NAT connection for Internet access.</p>
<h2 id="heading-using-cloud-service-mesh-outside-google-cloud">Using Cloud Service Mesh Outside Google Cloud</h2>
<p>As I mentioned earlier, Cloud Service Mesh can be installed on your GKE clusters running outside of Google Cloud, on VMWare, in AWS and Azure, and even on bare metal. This is achieved by using Google’s <code>asmcli</code> tool to install an in-cluster control plane that is managed via Cloud Service Mesh. Your clusters must still belong to your GKE Enterprise fleet and will require connectivity to Google Cloud APIs.</p>
<p>The in-cluster service mesh supports either using Google’s <strong>Mesh CA</strong> service or the Istio CA service as a certificate authority. This is used when creating mutual TLS (mTLS) certificates, which I’ll explore in a future post. Mesh CA is a reliable and scalable managed service specifically designed for managing workload mTLS certificates, and it's recommended to use the Mesh CA for your service mesh. You can optionally choose to use an existing Istio CA if you already have one set up, if you require a custom CA, or if there’s another reason not to use Google’s CA service, but this is a complex outlier use case.</p>
<p>With this in mind, let’s look at the steps required to enroll a cluster that exists <em>outside</em> of Google Cloud. These instructions should work for any of the supported platforms I’ve talked about already.</p>
<h3 id="heading-readying-your-cluster-for-service-mesh">Readying your cluster for service mesh</h3>
<p>First, we’ll need to download Google’s <code>asmcli</code> tool. You should do this wherever you connect to your Kubernetes clusters (for example, your local computer). You can download the tool and make it executable with these commands:</p>
<pre><code class="lang-bash">curl https://storage.googleapis.com/csm-artifacts/asm/asmcli_1.20 &gt; asmcli
chmod +x asmcli
</code></pre>
<p>Next, we need to create a cluster role binding so that we have the necessary permissions to create RBAC roles for service mesh. We’ll do this with the <code>kubectl</code> command, so it's important that you have authentication set up for your cluster in your local <code>.kube/config</code> file. How you achieve this will depend on where your cluster is running.</p>
<p>For each cluster context, run the following command to create the necessary permissions, swapping out <a target="_blank" href="mailto:you@youremail.com">you@youremail.com</a> for the user account that you identify with when connecting to Kubernetes:</p>
<pre><code class="lang-bash">kubectl create clusterrolebinding cluster-admin-binding \ 
  --clusterrole=cluster-admin \ 
  --user=you@youremail.com
</code></pre>
<p>Now we can run the <code>asmcli validate</code> command. This will check that your project and cluster will support the service mesh’s minimum requirements, and that the necessary permissions and APIs are set up. This command will also download and extract some useful sample manifest files:</p>
<pre><code class="lang-bash">./asmcli validate \ 
  --kubeconfig KUBECONFIG_FILE \
  --fleet_id FLEET_PROJECT_ID \
  --output_dir DIR_PATH \
  --platform multicloud
</code></pre>
<p>To replace the placeholders:</p>
<ul>
<li><p><code>KUBECONFIG_FILE</code> should point to your local <code>.kube/config</code> file for authentication.</p>
</li>
<li><p><code>FLEET_PROJECT_ID</code> refers to the project that owns your GKE fleet.</p>
</li>
<li><p><code>DIR_PATH</code> specifies a local directory where <code>asmcli</code> will write installation files and sample manifests.</p>
</li>
</ul>
<p>If <code>asmcli</code> detects any errors with your environment or clusters that might prevent service mesh from operating, it will report this to you and hopefully provided some suggested fixes. You will most likely see a few warnings that certain things need to be enabled (such as namespaces, cluster labels etc.), but we can actually get the <code>asmcli</code> command to do that for us in a moment.</p>
<h3 id="heading-installing-service-mesh-components">Installing service mesh components</h3>
<p>The next step is to install the service mesh control plane using the <code>asmcli</code> command. This process will register our clusters to the fleet (if they don’t already belong to it), install the service mesh components for the control and data planes, configure the mesh to trust the fleet’s workload identity, and finally set up remote secrets so that multiple clusters in the same fleet can trust each other.</p>
<p>Now that we’ve validated our environment, we can run the following command to install everything we need, substituting the same placeholders as last time:</p>
<pre><code class="lang-bash">./asmcli install \
  --fleet_id FLEET_PROJECT_ID \
  --kubeconfig KUBECONFIG_FILE \
  --output_dir DIR_PATH \
  --platform multicloud \
  --enable_all \
  --ca mesh_ca
</code></pre>
<p>Note the <code>--enable_all</code> option, which will correct any missing configuration that was identified during the validation stage. The <code>--ca</code> option specifies that we want to use Google’s managed Mesh CA service. When the command completes successfully, you will see a helpful note on how to enable sidecar injection.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732879016454/9b034cfb-27ed-4f62-b248-798737f91696.png" alt class="image--center mx-auto" /></p>
<p>The service mesh should now be deployed and running! The last thing to do before we attempt any deployments is to enable the sidecar auto-injection. You should be able to use the command that you captured previously from the output of the <code>asmcli</code> installation. If you don’t have this command to hand, all you need to do is find out the revision label for the <code>isiotd</code> deployment. We can find this out with this command:</p>
<pre><code class="lang-bash">kubectl -n istio-system get pods -l app=istiod --show-labels
</code></pre>
<p>Just look for the <code>istio.io/rev</code> label in the output. You should see something like <code>istio.io/rev=asm-1206-0</code>. You can then use this label to apply to a namespace like this:</p>
<pre><code class="lang-bash">kubectl label namespace my-namespace istio.io/rev=asm-1206-0 --overwrite
</code></pre>
<p>At the time of writing there was one final quirk that you may need to deal with: application monitoring and logging are not enabled by default for clusters outside of Google Cloud. We definitely want to see all the helpful visualisations in the Cloud Service Mesh dashboard, so we need to take steps to switch this functionality on.</p>
<p>An object called <code>stackdriver</code> in the <code>kube-system</code> namespace manages the integration of application logging. We can edit this object in situ and set the value of <code>enableCloudLoggingForApplications</code> to <code>true</code>. This can be done with the <code>kubectl edit</code> command:</p>
<pre><code class="lang-bash">kubectl –n kube-system edit stackdriver stackdriver
</code></pre>
<p>Look for the <code>enableCloudLoggingForApplications</code> option in the spec and set it to <code>true</code>. Hopefully this option will be automated in future releases!</p>
<h3 id="heading-meshing-multiple-clusters-externally">Meshing multiple clusters externally</h3>
<p>Now we understand how to deploy Cloud Service Mesh to clusters outside of Google Cloud, we can start to build up an external multi-cluster mesh. This solution is technically supported by Google provided that all clusters exist within the same single environment, whether that is in GKE on VMware, bare metal, Azure or AWS.</p>
<p>At a high level, we can mesh multiple clusters together by following these steps:</p>
<ul>
<li><p>Creating an east-west gateway between the clusters</p>
</li>
<li><p>Exposing local services through the gateways</p>
</li>
<li><p>Enabling endpoint discovery</p>
</li>
</ul>
<p>Google’s guide on how to set up this external multi-cluster mesh has a few different dependencies. Be careful, because you’re now at the cutting edge of multi-cluster technology, where things break easily!</p>
<p>Assuming we have two clusters in one of the supported environments, and that we’ve already installed Cloud Service Mesh, let’s walk through how to action these steps and get our clusters talking to each other. We’ll need a handful of files from the Anthos Service Mesh packages git repo we obtained earlier (available from <a target="_blank" href="https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages/tree/">https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages/tree/</a>).</p>
<p>You will also need the <code>istioctl</code> tool installed, which you can obtain from <a target="_blank" href="https://istio.io/latest/docs/setup/getting-started/">https://istio.io/latest/docs/setup/getting-started/</a></p>
<p>Inside the git repo, navigate to the <code>asm/istio/expansion folder</code>, where you’ll find a shell script called <code>gen-eastwest-gateway.sh</code>. This script takes in options about your environment and dynamically generates an <code>IstioOperator</code> object, which is then piped via the <code>istioctl</code> command into your cluster.</p>
<p>Assuming the <code>kubectl</code> context for your first cluster is called <code>cluster-1</code>, run the following command for <strong>GKE on VMware</strong> or <strong>GDVC Bare Metal</strong> clusters:</p>
<pre><code class="lang-bash">gen-eastwest-gateway.sh \
    --revision asm-1213-3 | istioctl --context cluster-1 \
    install -y --<span class="hljs-built_in">set</span> spec.values.global.pilotCertProvider=kubernetes -f -
</code></pre>
<p><em>Or</em> this version of the command if the clusters are running in <strong>Azure</strong> or <strong>AWS</strong>:</p>
<pre><code class="lang-bash">gen-eastwest-gateway.sh \
    --revision asm-1213-3 | istioctl --context cluster-1 \
    install -y --<span class="hljs-built_in">set</span> spec.values.global.pilotCertProvider= istiod -f -
</code></pre>
<p>Repeat the instructions for your other cluster, using the appropriate <code>kubectl</code> context.</p>
<p>Next, we’ll create the <code>Gateway</code> object that exposes services across both clusters. In the same git repo directory, you’ll find a script called <code>expose-services.yaml</code>. Run this for each of your cluster contexts, for example:</p>
<pre><code class="lang-bash">kubectl --context cluster-1 apply -n istio-system -f expose-services.yaml
</code></pre>
<p>Finally, we enable endpoint discovery by asking the <code>asmcli</code> tool to create the mesh. Sadly, at the time of writing, the tooling is still quite immature. Outside of Google Cloud, <code>asmcli</code> doesn’t have an easy way to identify clusters. To create the mesh, you must pass in individual <code>kubeconfig</code> files that contain the authentication credentials for each individual cluster. It’s likely that these credentials are all in one big <code>.kube/config</code> file right now, but you’ll need to manually split them out into individual files. Then you can run the command like this:</p>
<pre><code class="lang-bash">./asmcli create-mesh FLEET_PROJECT_ID \
  KUBECONFIG_FILE_1 KUBECONFIG_FILE_2
</code></pre>
<p>Replace <code>FLEET_PROJECT_ID</code> with the Google Cloud project ID that hosts your GKE fleet and substitute the other parameters with the filenames of the individual cluster credentials files you just created, and finally, you should have a multi-cluster mesh set up and ready to go! Feel free to return to the Hello World example from earlier in this post and test it out on your new non-Google fleet. Services should now be able to communicate across clusters via the gateways you configured.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732879393433/bd924cb7-e940-456a-920d-deee1ceef580.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-configuring-a-true-hybrid-mesh">Configuring a true hybrid mesh</h3>
<p>A true hybrid mesh would involve creating a single service mesh, including full service discovery, across multiple clusters that exist in <em>different</em> environments. At the time of writing, this capability was in preview for clusters that span GKE on Google Cloud and GKE on VMWare or bare metal. If you want to give this a try, see the documentation at <a target="_blank" href="https://cloud.google.com/service-mesh/docs/operate-and-maintain/hybrid-mesh">https://cloud.google.com/service-mesh/docs/operate-and-maintain/hybrid-mesh</a>. Needless to say, unless the offering and tooling has evolved considerably by the time you read this, the solution is not recommended for production environments!</p>
<h2 id="heading-summary">Summary</h2>
<p>We are now two-thirds of the way through our exploration of Cloud Service Mesh and service mesh principles in general. This post may have you thinking that the technologies involved are simply too immature and unstable to risk in production, despite the extra functionality that a service mesh provides. This is often the side effect of a cloud provider like Google Cloud taking a dynamic open-source project (in this case: Istio) and attempting to turn it into a managed service. The technology doesn’t slow down, and while we are usually willing to do some work “under the hood” on our own projects, we usually have high expectations of fully managed services. It’s likely that Google will continue to refine the offering of Cloud Service Mesh so that one day soon the experience in all environments, or across hybrid deployments, is just as smooth as when you mesh clusters inside Google Cloud.</p>
<p>In the meantime, Cloud Service Mesh inside Google Cloud can be considered quite stable, and running the service on a single external cluster is well supported by the existing tooling. While Google continues to improve its tools, you could consider managing your own Istio deployments for hybrid or multi-cluster scenarios – after all, outside of Google’s customer base, this is how lots of other people are managing meshed container workloads.</p>
<p>In my next planned post I’ll finish our Service Mesh journey by discussing its potential for securing our workloads. Thankfully, this is an area where the technology is mature and works well across any environment once you’ve done the hard work of installing the mesh!</p>
]]></content:encoded></item><item><title><![CDATA[GKE, Istio and Managed Service Mesh]]></title><description><![CDATA[This is the sixth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to supp...]]></description><link>https://timberry.dev/gke-istio-and-managed-service-mesh</link><guid isPermaLink="true">https://timberry.dev/gke-istio-and-managed-service-mesh</guid><category><![CDATA[gke]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[service mesh]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[#istio]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Mon, 18 Nov 2024 11:57:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731923412187/f98dad89-27b0-4eb5-9ed1-617c3bfb5d9e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the sixth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>In the first part of this series, we’ve focused on the new concepts that GKE Enterprise introduces for managing your clusters such as fleets and configuring features and policies. We’ll now pivot to look more closely at the advanced features that GKE Enterprise provides for managing our workloads. Since the early days of Kubernetes, workloads have typically been deployed following a microservices architecture pattern, deconstructing monoliths into individual services, and then managing and scaling these components separately. This provided benefits of scale and flexibility, but soon enough issues developed around service discovery, security and observability. The concept of a Service Mesh was introduced to address all of these concerns.</p>
<p>But what is a Service Mesh? How do you use one and do you even need one? Once people have mastered the complexity of Kubernetes, introducing a Service Mesh seems like yet another complicated challenge to master. There are arguments on both sides for the necessity of a Service Mesh and the benefits it provides versus the resources it consumes and the complexity that it either solves or introduces. By the end of this post, hopefully you’ll have a good understanding of these issues and know when and where using a Service Mesh is right for you.</p>
<p>So here’s what we’re going to cover today:</p>
<ul>
<li><p>Understanding the concept and design of a Service Mesh</p>
</li>
<li><p>How to enable Service Mesh components in your GKE cluster and fleet</p>
</li>
<li><p>Creating Ingress and Egress gateways with Service Mesh</p>
</li>
<li><p>Using Service Mesh to provide network resilience</p>
</li>
</ul>
<p>We’ll focus on the fundamentals of Service Mesh in this post, and then build on that knowledge in the next two posts as we leverage Service Mesh for multi-cluster networking and security.</p>
<h2 id="heading-what-is-a-service-mesh">What is a Service Mesh?</h2>
<p>To fully explain the concept of a Service Mesh, it helps to consider the interconnected parts of a Kubernetes cluster from the point of view of the <strong>control plane</strong> and the <strong>data plane</strong>.</p>
<ul>
<li><p>A typical cluster may be running many different containers and services at any time, and these workloads, along with the nodes that are hosting them, can be considered the <strong>data plane</strong>.</p>
</li>
<li><p>The <strong>control plane</strong> on the other hand is the brain of the cluster. It comprises the scheduler and controllers that make decisions about what to run and where. Essentially, the control plane tells the data plane what to do.</p>
</li>
</ul>
<p>But the control plane in Kubernetes has limitations. For example, most common patterns for network ingress only provide flexibility at the ingress layer, without leaving much control over traffic or observability inside the cluster. A Service Mesh is an attempt to extend these capabilities by providing an additional control plane, specifically for service networking logic. This new Service Mesh control plane works alongside the traditional control plane, but it can now provide advanced logic to control how workloads operate and how their services connect to each other and the outside world.</p>
<p>Some of the most useful features of a Service Mesh are:</p>
<ul>
<li><p>Controlling inter-service routing</p>
</li>
<li><p>Setting up failure recovery and circuit breaking patterns</p>
</li>
<li><p>Microservice-level traffic observability</p>
</li>
<li><p>Mutual end-to-end service encryption and service identity</p>
</li>
</ul>
<p>Once you’ve learned how to implement these new features, you should have a better idea of where they can be useful and if you want to use them or not for your own workloads.</p>
<h3 id="heading-istio-and-the-sidecar-pattern">Istio and the Sidecar pattern</h3>
<p>The Service Mesh in GKE Enterprise is powered by Istio, one of the most popular open-source Service Mesh projects. Istio provides advanced traffic management, observability and security benefits, and can be applied to a Kubernetes cluster without requiring any manual changes to existing deployments. For more information about Istio, see the website at <a target="_blank" href="https://istio.io/">https://istio.io/</a></p>
<p>Istio deploys controllers into the control plane of your clusters, and gains access to the data plane using a sidecar pattern. This involves deploying an Envoy proxy container that is run as a sidecar container into each workload Pod. This proxy takes over all network communication for that Pod, providing an extension of the data plane. The proxy is then configured by the new Istio control plane. Hence the terminology of a mesh: Istio is overlaying the services in your cluster with its own web of proxies. Here’s a basic illustration of this concept:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731665765615/6d3973ba-2e80-46fc-a345-d01cbd5e7c0b.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-deploying-service-mesh-components">Deploying Service Mesh components</h2>
<p>Service Mesh can be installed in a few different ways when using GKE Enterprise. The recommended approach is to enable the managed Service Mesh service, which provides a fully managed control plane for Istio. You can enable Service Mesh from the Feature Manager page in the GKE Enterprise section of the Google Cloud console, as shown below. Once Service Mesh is enabled for your fleet, new clusters registered to it will automatically have Service Mesh installed and managed. You can optionally sync your fleet settings to any existing clusters; however, you will need to enable Workload Identity on these clusters first if you haven’t already.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731666005088/ff86b798-99d4-4a4b-bd1a-e6221667f6d3.png" alt class="image--center mx-auto" /></p>
<p>Using the managed option for Service Mesh reduces the load on your clusters because most of the control plane work is offloaded to the managed service. Enrolled clusters will still need to run the <code>mdp-controller</code> deployment, which requires 50 millicores of CPU and 128Mi of RAM, and the <code>istio-cni-node</code> daemonset on each node, which requires 100 millicores of CPU and 100Mi of RAM per node.</p>
<p>If you have a specific use case outside of the managed service, you can use Google’s <code>asmcli</code> tool to install the Istio Service Mesh directly to your clusters and bypass the managed service. You may choose to do this for example if you need a specific Istio release channel that isn’t aligned to your GKE release channel, or if you need to integrate a private certificate authority. Generally, the managed service is recommended to reduce the complexity involved.</p>
<h3 id="heading-enabling-sidecar-injection">Enabling sidecar injection</h3>
<p>At this stage we’ve just set up the control plane for our Service Mesh, and the next thing to do is to set up the data plane. As we mentioned earlier, the data plane comprises an Envoy proxy sidecar container running inside our application Pods. This proxy takes over all network communication for our applications and communicates with the Service Mesh control plane.</p>
<p>Conveniently, we can enable the automatic injection of sidecars into our workloads. This is done at the namespace level and can be achieved simply by adding a metadata label to a namespace. For example, if we have a namespace called <code>frontend</code>, we can label it for sidecar injection with this command:</p>
<pre><code class="lang-bash">kubectl label namespace frontend istio-injection=enabled
</code></pre>
<p>Any new Pod now created in this namespace will also run the Envoy proxy sidecar. This won’t affect existing Pods unless they are terminated, and then potentially recreated by their controller object (such as a Deployment). A quick way to tell if sidecar injection has worked is to simply look at the output from <code>kubectl get pods</code> and note that you should have one more ready container than you used to (for example, 2/2 containers ready in a Pod).</p>
<p>Alternatively, you can manually inject the sidecar container by modifying the object configuration, such as a Deployment YAML file; the <code>istioctl</code> command line tool can do this for you. For more information see <a target="_blank" href="https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#manual-sidecar-injection">https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#manual-sidecar-injection</a></p>
<h3 id="heading-ingress-and-egress-gateways">Ingress and Egress gateways</h3>
<p>The next stage in setting up our Service Mesh is to consider gateways, which manage inbound and outbound traffic. We can optionally configure both Ingress gateways to manage incoming traffic, and Egress gateways to control outbound traffic. Ingress and Egress gateways comprise standalone Envoy proxies that are deployed at the edge of the mesh, rather than attached to workloads.</p>
<p>The benefit of using Service Mesh gateways is that they give us much more control than using low level objects like <code>Service</code> or even the Kubernetes <code>Ingress</code> object. These previous attempts have bundled all configuration logic into a single API object, whereas our Service Mesh lets us separate configuration into a load balancing object – the <code>Gateway</code> – and an application-level object – the <code>VirtualService</code>.</p>
<p>In the OSI network model, load balancing takes place at layers 4-6 and involves things like port configurations and transport layer security. Traffic routing is a layer 7 issue and can now be handled separately by the Service Mesh.</p>
<p>We’ll talk more about <code>VirtualServices</code> later in this post, but for now let’s just set up an Ingress Gateway. The minimum requirement for this will be a deployment of the <code>istio-proxy</code> container, a matching service and the necessary service accounts and RBAC role assignments. For scalability it’s also a good idea to attach a Horizontal Pod Autoscaler to the deployment. Thankfully, Google have done the hard work for us and provided a git repo here: <a target="_blank" href="https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages.git">https://github.com/GoogleCloudPlatform/anthos-service-mesh-packages.git</a> that contains all the necessary object definitions.</p>
<p>Before we go ahead and apply those manifests though, we quickly need to think about namespaces again. A gateway should be considered a user workload, so it should run inside a namespace to which your users, or developers, have access. Depending on the way your teams are set up in your organization you may wish to have a central dedicated namespace just for the Ingress Gateway, for example <code>gateway-ns</code>, or you may wish to create gateways in the same namespace as the workloads they serve. Either pattern is acceptable, you just need to choose the one that works for the way you manage user access to your clusters.</p>
<p>For now, let’s go ahead and create a dedicated namespace for the gateway:</p>
<pre><code class="lang-bash">kubectl create ns gateway-ns
</code></pre>
<p>Just like we demonstrated earlier, we need to apply a metadata label to this namespace that will enable Istio’s auto-injection:</p>
<pre><code class="lang-bash">kubectl label namespace gateway-ns istio-injection=enabled
</code></pre>
<p>Now we can go ahead and apply the manifests from the git repo to create the Ingress Gateway. Inside the git repo in your local filesystem, change into the <code>samples/gateways</code> directory, and then apply the manifests to our chosen namespace with this command:</p>
<pre><code class="lang-bash">kubectl apply –n gateway-ns –f istio-ingressgateway
</code></pre>
<p>This will apply all of the objects in that directory, including the deployment, the autoscaler, RBAC configuration and even a Pod disruption budget. At this stage this might feel eerily familiar to just deploying a plain old Ingress controller, but we have a few more moving pieces to deploy before it will start to make sense.</p>
<h3 id="heading-deploying-sample-microservices">Deploying sample microservices</h3>
<p>Continuing to use Google Cloud’s demo repo, let’s go ahead and deploy the Online Boutique application stack. This is a neat microservices demonstration that will deploy multiple workloads and give us great visualisations in the Service Mesh dashboard later.</p>
<p>From the <code>samples/online-boutique/kubernetes-manifests</code> directory of the git repo in your local filesystem, create all the required namespaces with this command:</p>
<pre><code class="lang-bash">kubectl apply –f namespaces
</code></pre>
<p>We’ll also need to enable sidecar injection on each namespace. We can do this with a handy bash <code>for</code> loop:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> ns <span class="hljs-keyword">in</span> ad cart checkout currency email frontend loadgenerator payment product-catalog recommendation shipping; <span class="hljs-keyword">do</span> 
  kubectl label namespace <span class="hljs-variable">$ns</span> istio-injection=enabled 
<span class="hljs-keyword">done</span>;
</code></pre>
<p>Then we’ll create the service accounts and deployments:</p>
<pre><code class="lang-bash">kubectl apply –f deployments
</code></pre>
<p>And finally, the services:</p>
<pre><code class="lang-bash">kubectl apply –f services
</code></pre>
<p>You should now be able to see that all Pods in the namespaces we’ve created have 2 containers running in them, because the Envoy proxy has been successfully injected. But so far, these are still all the regular Kubernetes objects that we already know about. When do we get to the Istio CRDs?</p>
<h3 id="heading-service-entries">Service Entries</h3>
<p>The demo Online Boutique application makes use of a few external services, such as Google APIs and the Metadata server. When your workloads need to access external network connections, they can of course do this directly. Making a network request to a service that is unknown to the mesh will simply pass through the proxy layer. But what if you could treat external connections as just another hop in your mesh, and benefit from the observability this could give you? That’s the job of the <code>ServiceEntry</code> object, which allows you to define external connections and treat them as services registered to your mesh.</p>
<p>Back in our git repo, move to the <code>samples/online-boutique/istio-manifests</code> directory and take a look at the <code>allow-egress-googleapis.yaml</code> file. In this manifest, service entries are created for <code>allow-egress-googleapis</code> and <code>allow-egress-google-metadata</code>, and both entries specify the hosts and ports required for the connections.</p>
<p><code>ServiceEntry</code> objects can also be used to add sets of virtual machines to the Istio service registry and can be combined with other Istio objects (which we’ll learn about soon) to control TLS connections, retries, timeouts and more.</p>
<p>We’ll apply this manifest to our cluster, followed by the <code>frontend-gateway.yaml</code> file in the same directory. This will create two more Istio CRD objects, a <code>Gateway</code> and a <code>VirtualService</code>.</p>
<p>Now at this stage we’ve deployed all kinds of workloads and objects, so we need to slow down and take a step back to explain how they all work together. Here’s a rough diagram of the request from a user getting all the way to Pods that serve the frontend workload – this is basically the website element of the Online Boutique demo stack:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731685516560/8b1b1ee3-aad5-4758-b42d-c771a7073e6c.png" alt class="image--center mx-auto" /></p>
<p>Firstly, our Google Cloud Load Balancer provides our actual endpoint on the Internet with an anycast IP address. Requests received by the load balancer are routed through the <strong>istio-ingressgateway</strong> <code>Service</code> object to Pods in the <strong>istio-ingressgateway</strong> <code>Deployment</code>. Recall that this is the Ingress Gateway we deployed right at the start.</p>
<p>At the bottom of the diagram is our actual <strong>frontend</strong> <code>Deployment</code>, comprising Pods that will receive traffic from the <strong>frontend</strong> <code>Service</code>. So far, these are standard Kubernetes objects. So, let’s explore the new CRDs in between.</p>
<h3 id="heading-gateway">Gateway</h3>
<p>Although we already created the <strong>istio-ingressgateway</strong> <code>Deployment</code> that actually provides the Ingress Gateway, we haven’t yet configured it. The <code>Gateway</code> CRD object describes the gateway, including a set of ports to expose over which protocols and any other necessary configuration. In the definition used by our example, we specify that port 80 should be exposed for HTTP traffic, and that we accept traffic for all hosts (denoted by the asterisk in this section of the manifest). The <code>istio: ingressgateway</code> selector in the <code>Gateway</code> CRD object tells Istio to apply this configuration to the Pods in our <strong>istio-ingressgateway</strong> <code>Deployment</code>, because if you take a look at that manifest, you’ll see a matching label in its metadata.</p>
<p>For more detail about what can be accomplished with the Gateway object see <a target="_blank" href="https://istio.io/latest/docs/reference/config/networking/gateway">https://istio.io/latest/docs/reference/config/networking/gateway</a></p>
<h3 id="heading-virtual-services">Virtual Services</h3>
<p>The <code>VirtualService</code> object configures the routing of traffic once it has arrived through our <code>Gateway</code>. In our example manifest we specify a single HTTP route so it will capture all traffic, and we specify that the “backend” for this traffic is a host called <strong>frontend</strong>. Hosts simply represent where traffic should be sent, and could be IP addresses, DNS names or Kubernetes service names. In this case, we can get away with using a short-name like <strong>frontend</strong> because the <code>Service</code> its referring to exists in the same namespace as the <code>VirtualService</code>.</p>
<p>The <code>VirtualService</code> object also references the <strong>frontend-gateway</strong> <code>Gateway</code> object we created earlier. This is a pattern of Istio, where by referencing the <code>Gateway</code> like this, we apply our <code>VirtualService</code> configurations to it. This is similar to how our <code>Gateway</code> object applied configurations to our ingress gateway Pods.</p>
<p>The <code>VirtualService</code> CRD allows for some very flexible configuration. Although this example routes all traffic to a single backend, we could choose to route traffic to different hosts based on different routing rules which would be evaluated in sequential order. For example, routing rules can specify criteria such as HTTP headers, URL paths, specific ports or source labels. We can match routing rules based on exact strings, prefixes or regular expressions. And we can even specify policies for HTTP mirroring and fault injection.</p>
<p>For more detail about the <code>VirtualService</code> CRD, see <a target="_blank" href="https://istio.io/latest/docs/reference/config/networking/virtual-service/">https://istio.io/latest/docs/reference/config/networking/virtual-service/</a></p>
<h3 id="heading-service-mesh-dashboards">Service Mesh dashboards</h3>
<p>With the Online Boutique sample stack running, we can access the external endpoint provided by the load balancer and browse around a few pages. Just use <code>kubectl</code> to get the service in the <code>gateway-ns</code> namespace and copy its external IP.</p>
<p>Back in the Kubernetes section of the Google Cloud console, we can choose <strong>Service Mesh</strong> from the sub-menu, and see a list of services that our mesh knows about, as shown in the screenshot below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731686874693/520a4e42-2ae6-4a78-af12-58307e68c1ab.png" alt class="image--center mx-auto" /></p>
<p>Straight away you should be able to see some of the benefits of using Istio! Because the Envoy proxies capture all network information, from the moment a request enters the ingress gateway and even while traffic traverses between microservices, we suddenly have a much higher level of observability than we would when using traditional Kubernetes objects. We can view the traffic rate for each microservice, along with its error rate and latency. Selecting individual services will also allow us to create Service Level Objectives (SLOs) and alerts for when they are not met. This is a powerful level of observability!</p>
<p>The dashboard also provides a graphical topology view, breaking down how each microservice is connected through its Pods, service and Istio service, as shown below. Part of the magic of the Service Mesh is that we don’t need to tell Istio how all of our microservices are connected together. The mesh topology is automatically generated by Istio simply observing network connections between services, thanks to the Envoy proxies.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731686938665/bd173e36-9773-4cf5-8281-1bd73314aadf.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-advanced-configuration">Advanced Configuration</h2>
<p>The Online Boutique example, while it does contain multiple microservices, is still quite a simple example of deploying a default mesh over existing services. However, using the new CRDs we have available from Istio we could add some advanced configuration that might prove useful. Let’s explore some more of Istio’s capabilities with some theoretical scenarios.</p>
<h3 id="heading-advanced-routing-patterns">Advanced routing patterns</h3>
<p>As we mentioned earlier, the <code>VirtualService</code> object can be used to implement some advanced routing patterns. A useful example of this is inspecting the user-agent HTTP header and routing the request accordingly. We can create conditional routes like this simply by adding a <code>match</code> to the <code>http</code> section of the object:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">http:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">match:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">headers:</span>
        <span class="hljs-attr">user-agent:</span>
          <span class="hljs-attr">regex:</span> <span class="hljs-string">^(.*?;)?(iPhone)(;.*)?$</span>
      <span class="hljs-attr">route:</span>
        <span class="hljs-attr">destination:</span>
          <span class="hljs-attr">host:</span> <span class="hljs-string">frontend-iphone</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">port:</span>
          <span class="hljs-attr">number:</span> <span class="hljs-number">80</span>
</code></pre>
<p>In this example, a regular expression finds user agents from iPhone users and directs their requests to a different service: <strong>frontend-iphone</strong>. Any other requests will continue to be handled by the frontend service. Any standard or custom HTTP header can be used.</p>
<p>Complex matching rules can be based on other parameters such URI and URI schemes, HTTP methods, ports and even query parameters. Once you have defined matching criteria for the different types of requests you want to capture, you specify a destination host. In the Online Boutique example, these are simple internal hostnames matching the <code>Service</code> objects that have been created. However, we can also provide more granular instructions for routing by combing <code>VirtualServices</code> with a <code>DestinationRule</code>.</p>
<h3 id="heading-subsets-and-destination-rules">Subsets and Destination Rules</h3>
<p>The <code>DestinationRule</code> object applies logic to <em>how</em> a request should be routed once the <code>VirtualService</code> has determined <em>where</em> it should be routed. These objects let us configure advanced load balancing and traffic policy configurations, as well as implement patterns for canary and A/B testing.</p>
<p>A common pattern is to define the exact load balancing behaviour you require. For example, to apply the <code>LEAST_REQUEST</code> behaviour to requests for the <strong>frontend</strong> service, we could create the following object:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">DestinationRule</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-destination</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">trafficPolicy:</span>
    <span class="hljs-attr">loadBalancer:</span>
      <span class="hljs-attr">simple:</span> <span class="hljs-string">LEAST_REQUEST</span>
</code></pre>
<p>With this object applied, endpoints handling the lowest current number of outstanding requests will be favoured by the load balancer. An alternative policy would be <code>ROUND_ROBIN</code> that will simple round-robin across all endpoints, although this is generally considered unsafe for many scenarios. These configurations can also be applied to a subset of endpoints, and in general, subsets can be used to implement traffic weighting and test patterns.</p>
<p>A common approach to testing changes with production traffic is the canary pattern, where a small subset of traffic is routed to a newer version of a workload to test for any issues. If the service is reliable, the new version of the workload can be promoted. Previously, this pattern could be achieved simply by creating <code>Deployments</code> that were sized to a particular ratio; for example, <code>Deployment</code> A containing 8 Pods and <code>Deployment</code> B containing 2 Pods, then using <code>Deployment</code> B for a canary workload. A <code>Service</code> routing traffic to all Pods would logically hit the canary workload 20% of the time.</p>
<p>However, this low-level approach is not scalable. With Istio CRDs we can define multiple routes in our <code>VirtualService</code> and simply assign a weighting to them. We do this by adding a <code>subset</code> to each destination, along with a <code>weight</code>. In our <code>DestinationRule</code> we then define these subsets, by adding additional label selectors that will match the appropriate Pods.</p>
<p>Here’s an example of an updated <code>VirtualService</code> definition for our frontend. We will now route 5% of all traffic to the canary version of the frontend:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">VirtualService</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-ingress</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">frontend</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">hosts:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">"*"</span>
  <span class="hljs-attr">gateways:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">frontend-gateway</span>
  <span class="hljs-attr">http:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">subset:</span> <span class="hljs-string">prod-frontend</span>
      <span class="hljs-attr">weight:</span> <span class="hljs-number">95</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">subset:</span> <span class="hljs-string">canary-frontend</span>
      <span class="hljs-attr">weight:</span> <span class="hljs-number">5</span>
</code></pre>
<p>We now need to create a matching <code>DestinationRule</code>, which defines additional configuration for the <strong>frontend</strong> host we’re referencing in our route. Inside the <code>DestinationRule</code>, we create the definitions for the subsets that are referenced in the <code>VirtualService</code>. Those subsets simply specify the extra label selectors that should be used.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">DestinationRule</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-destination</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">trafficPolicy:</span>
    <span class="hljs-attr">loadBalancer:</span>
      <span class="hljs-attr">subsets:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">prod-frontend</span>
        <span class="hljs-attr">labels:</span>
          <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">canary-frontend</span>
        <span class="hljs-attr">labels:</span>
          <span class="hljs-attr">version:</span> <span class="hljs-string">v2</span>
</code></pre>
<p>Putting the objects together, we can see that 95% of the time requests will be routed to Pods that are part of the <strong>frontend</strong> service, but also have the label <code>version:v1</code> in their metadata, while 5% of the time requests will be routed to Pods in that service with <code>version:v2</code> instead.</p>
<h3 id="heading-timeouts-and-retries">Timeouts and Retries</h3>
<p>Our new Service Mesh objects also provide some configuration options that can help with network resilience. For example, within the <code>route</code> section of our <code>VirtualService</code> object, we can specify a maximum timeout time to prevent a request hanging excessively waiting for a response. Returning a timeout error within a more appropriate timeframe can help us to spot errors more quickly, particularly when they are set up with appropriate SLOs and monitoring.</p>
<p>By default, failed requests will be retried by the Envoy proxy. Failures are sometimes due to transient network problems or simply an overloaded service, so with luck a retry may succeed. However, you may wish to tune this behaviour to enhance your overall application performance.</p>
<p>In this example, we add configuration to the <code>route</code> section of our <code>VirtualService</code> to specify that only 3 retries should be attempted, and that each retry should have an individual timeout of 3 seconds:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">http:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">subset:</span> <span class="hljs-string">prod-frontend</span>
      <span class="hljs-attr">retries:</span>
        <span class="hljs-attr">attempts:</span> <span class="hljs-number">3</span>
        <span class="hljs-attr">perTryTimeout:</span> <span class="hljs-number">3</span>
        <span class="hljs-attr">retryOn:</span> <span class="hljs-string">connect-failure,409</span>
</code></pre>
<p>Here we’re also using <code>retryOn</code> to tell the Envoy proxy specifically that retries should only be attempted if we received a connection timeout or the <strong>HTTP 409 Too Busy</strong> response.</p>
<h3 id="heading-circuit-breaking">Circuit Breaking</h3>
<p>Circuit breaking is an interesting design pattern for distributed systems that is designed to prevent overloading or cascading failures across services. Much like a circuit breaker in your home will trip and stop electricity flowing to a faulty circuit, this pattern is designed to isolate a problematic system so that requests are no longer sent to it if we know for certain that the service has failed.</p>
<p>In Istio and Envoy, circuit breaking is achieved through the concept of “outlier detection”, in other words, detecting network behaviour that is abnormal. In a <code>DestinationRule</code>, we can define what an outlier behaviour would look like, which will trigger the ejection of Pods from a backend if that behaviour definition is met. Envoy continues to monitor ejected Pods and can automatically add them back into a load balancing pool if they return to normal behaviour.</p>
<p>Here’s an example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.istio.io/v1alpha3</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">DestinationRule</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-destination</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">subsets:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">prod-frontend</span> 
  <span class="hljs-bullet">-</span> <span class="hljs-attr">labels:</span>
      <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
    <span class="hljs-attr">trafficPolicy:</span>
      <span class="hljs-attr">connectionPool:</span>
        <span class="hljs-attr">tcp:</span>
          <span class="hljs-attr">maxConnections:</span> <span class="hljs-number">100</span>
      <span class="hljs-attr">outlierDetection:</span>
        <span class="hljs-attr">consecutive5xxErrors:</span> <span class="hljs-number">3</span>
        <span class="hljs-attr">interval:</span> <span class="hljs-string">1s</span>
        <span class="hljs-attr">baseEjectionTime:</span> <span class="hljs-string">2m</span>
        <span class="hljs-attr">maxEjectionPercent:</span> <span class="hljs-number">50</span>
</code></pre>
<p>In this configuration we have now limited the total number of concurrent connections to version 1 of our frontend to 100 by defining a connection pool within our traffic policy. Then we’ve created some parameters for outlier detection. If the service returns three consecutive HTTP 500 errors within a 1 second interval, the behaviour is considered abnormal, and the circuit breaker will trip. The Pod responsible for the behaviour will be ejected from the load balancing pool; this essentially means Istio will not consider it a candidate to receive requests, even though the Pod itself is not changed. We have specified that the base ejection should be 2 minutes, but the actual ejection time will be multiplied by the number of ejections, providing a kind of exponential backoff. Finally, we have also stated that we can only ever eject half of the Pods in this service.</p>
<h3 id="heading-injecting-faults">Injecting faults</h3>
<p>Fault injection is another interesting tool used to help test the resilience of a microservice architecture stack. If you’ve never done this before, you may think it’s odd to deliberately make services fail some of the time, but it’s better to know how your interdependent services will cope during a testing phase than to find out once everything has gone live! Historically, fault injection has required complex instrumentation software, but once again we can set this up using the all-powerful <code>VirtualService</code> object.</p>
<p>There are two different ways to simulate problems with the <code>VirtualService</code> object, which are defined in the <code>fault</code> subsection of the <code>http</code> section. The first is <code>delay</code>, which will create a delay before a request is forwarded from the preceding Envoy proxy, which can be useful for simulating things like network failures. The second option is <code>abort</code>, which will return error codes downstream, simulating a faulty service.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">http:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">subset:</span> <span class="hljs-string">frontend-production</span>
    <span class="hljs-attr">fault:</span>
      <span class="hljs-attr">delay:</span>
        <span class="hljs-attr">percentage:</span>
          <span class="hljs-attr">value:</span> <span class="hljs-number">2.0</span>
        <span class="hljs-attr">fixedDelay:</span> <span class="hljs-string">5s</span>
      <span class="hljs-attr">abort:</span>
        <span class="hljs-attr">percentage:</span>
          <span class="hljs-attr">value:</span> <span class="hljs-number">1.0</span>
        <span class="hljs-attr">httpStatus:</span> <span class="hljs-number">503</span>
</code></pre>
<p>In the snippet above, we apply both techniques. A 5 second delay will be added to 2% of requests, while 1 in 100 requests will return an <strong>HTTP 503</strong> error. We could apply this sort of configuration to a staging environment to test how well our other microservices deal with such faults. For example, you might be able to determine how well frontend services cope if some backend services fail or are degraded a percentage of the time.</p>
<h3 id="heading-mirroring-traffic">Mirroring traffic</h3>
<p>A final outstanding feature of Istio is the ability to create a complete mirror of all service traffic, for the purposes of capture and analysis. In the <code>VirtualService</code> object, the <code>mirror</code> section can define a secondary service which will be sent a complete copy of all traffic. This allows you to perform traffic analysis on traffic without interrupting any live traffic, and your mirror service will never interact with the original requestor.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">http:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">route:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">destination:</span>
        <span class="hljs-attr">host:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">subset:</span> <span class="hljs-string">frontend-production</span>
    <span class="hljs-attr">mirror:</span>
      <span class="hljs-attr">host:</span> <span class="hljs-string">perf-capture</span>
    <span class="hljs-attr">mirrorPercent:</span> <span class="hljs-number">100</span>
</code></pre>
<p>In the above snippet, a copy of 100% of production traffic is also sent to a service called <strong>perf-capture</strong>. This could typically be an instance of a network or performance analysis tool such as Zipkin or Jaegar. Alternatively, traffic mirroring can also be used for testing and canary deployments. This way you can send real traffic to a canary service, without ever exposing it to your end-users.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this post I’ve attempted to demystify the complex world of Service Mesh. As you’ll now appreciate, Istio and its CRDs add a whole new level of complexity to the control plane of our clusters, however they also provide us with some incredible new powers and abilities in terms of service management and observability. We’ve learned how to define the logic of application and network load balancing through Service Mesh, how to route requests to backend services and subsets of those services, and how to set up advanced routing and load balancing configurations. We’ve also looked at some features that would be extremely difficult to obtain without a Service Mesh, such as circuit breaking and traffic mirroring.</p>
<p>You might now be considering whether Service Mesh is right for your application, your cluster or even your fleet. But don’t decide just yet! In this post we’ve just laid down the foundations of Service Mesh. Over the next couple of posts I’ve got planned I’ll continue to explore its capabilities, first in multi-cluster scenarios, and finally in how it can help us with microservice application security. Then you’ll either be ready to fully embrace the mesh for all your projects, or never switch it on again!</p>
]]></content:encoded></item><item><title><![CDATA[Fleet Ingress options for GKE Enterprise]]></title><description><![CDATA[This is the fifth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to supp...]]></description><link>https://timberry.dev/fleet-ingress-options-for-gke-enterprise</link><guid isPermaLink="true">https://timberry.dev/fleet-ingress-options-for-gke-enterprise</guid><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[ingress]]></category><category><![CDATA[multi-cluster-setup]]></category><category><![CDATA[multi-cloud]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Thu, 24 Oct 2024 08:28:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729610326556/521fe43b-a091-47a1-99ee-faa4cfbef091.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the fifth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>So far in this series we’ve been focussing on how to build various types of clusters, and of course we’ve lightly touched on the concept of <em>fleets</em> while doing so. As we know, fleets are collections of GKE clusters, and we can use this top-level organisational concept to enable features and assign configurations across all clusters in a fleet.</p>
<p>In this post we’ll expand our knowledge of fleets and learn how they are used for advanced patterns of network and cluster isolation, and how we manage ingress across multi-cluster fleets. Fleets are an advanced grouping concept that allow us to manage collections of clusters based on different ideas of separation, such as deployment environment or even business unit. Fleets also allow us to distribute workloads and services easily across multiple clusters for high availability and failover.</p>
<p>In my <a target="_blank" href="https://timberry.dev/configuration-management-in-gke-enterprise">previous post</a>, we looked at an example of enabling configuration and policy management across all clusters in a fleet. But fleets also enable us to scale to multiple clusters while encouraging the normalisation of configuration and resources across clusters inside the same fleet. Google recommends the concept of <em>sameness</em> across all clusters in a fleet, using configuration management to create the same namespace and identity objects. This makes it much easier to deploy workloads that can span clusters in different ways, helping to provide advantages like geo-redundancy and workload isolation. By the end of this post, you’ll be able to identify which of Google’s recommended fleet design patterns would be most appropriate for your organisation!</p>
<p>Specifically, we’re going to learn:</p>
<ul>
<li><p>Examples of Google-recommended fleet designs</p>
</li>
<li><p>The concepts of north-south and east-west routing</p>
</li>
<li><p>Use-cases for deploying workloads to multiple clusters</p>
</li>
<li><p>How to set up multi-cluster services and gateways</p>
</li>
</ul>
<p>Let’s jump in!</p>
<h2 id="heading-multi-cluster-design-patterns">Multi-cluster Design Patterns</h2>
<p>We've already seen fleets in action in previous posts in this series, but let’s take a step back and truly familiarise ourselves with the fundamentals of fleets on GKE Enterprise before continuing any further.</p>
<p>Fleets, as we’ve said, are logical grouping of clusters and other configuration artifacts that belong to those clusters. There is a one-to-one mapping of projects and fleets: a single host project may hold a single fleet. However, as we’ll see in a moment, it is common to maintain multiple host projects and fleets, and clusters in different projects can belong to a fleet in another project.</p>
<p>Features such as configuration management and policy controllers are enabled at the fleet level. While it is possible to be selective about which clusters in the fleet are affected by these features, it’s a recommended best practice to normalise your fleet and apply configurations to all clusters. Clusters can only belong to a single fleet, but by grouping and managing multiple clusters into a fleet, we move the isolation boundary of trust from individual clusters to the level of the fleet itself. This means that there is an implied level of trust between individual clusters, and we must now consider the trust boundary from the point of view of different fleets and projects. If you have followed Google’s best practices for the organizational hierarchy of your cloud resources already, it’s likely that you already have similar project-based trust boundaries already established.</p>
<p>Let’s look at a multi-cluster example to explore this approach further.</p>
<h3 id="heading-a-multi-cluster-example">A multi-cluster example</h3>
<p>For this fictional example, we will consider an organisation with users in Europe and Asia, with a development team in the USA. The organisation currently deploys production workloads to GKE clusters in the <code>europe-west1</code> and <code>asia-southeast1</code> regions within the same Google Cloud project. Additionally, within that project, a staging GKE cluster is maintained in the <code>us-central1</code> region. Development work takes place on an on-premises GKE cluster in the organisation’s USA office.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729520960158/11a47a8f-f424-4480-81eb-29edec4825fb.png" alt class="image--center mx-auto" /></p>
<p>Within each production cluster we can imagine several namespaces for different applications, or frontend and backend services. When deciding how these clusters should be set up in fleets, we need to consider the concepts of <strong>isolation</strong> and <strong>consistency</strong>. If clusters are consistent, they may belong to the same fleet, however a degree of isolation is usually required to keep a production environment safe from unwanted changes. Right now, all the clusters belong to the same project and fleet, which may not be the best approach. So let’s look at a few different design patterns to see how we can achieve different levels of isolation and consistency, which should help you choose which one is right for your team.</p>
<h3 id="heading-maximum-isolation-with-multiple-fleets">Maximum isolation with multiple fleets</h3>
<p>Let’s rearrange our clusters across different projects which are separated based on their operating environment: production, staging or development. As I mentioned earlier, this is a common approach to managing Google Cloud projects themselves, not just GKE clusters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521048443/11944672-6089-47fb-ae49-052bf21cb3a2.png" alt class="image--center mx-auto" /></p>
<p>In a high-security environment this gives us the strongest possible isolation between clusters. However, we have now increased our operational complexity because we are managing three different fleets, which means three different sets of configuration management and feature settings. For example, some extra effort will be required to maintain consistency across namespaces and policies in the three sources of truth used. Additionally, development teams will need permissions configured across both development and staging fleets.</p>
<h3 id="heading-project-isolation-with-a-shared-fleet">Project isolation with a shared fleet</h3>
<p>To simplify management while retaining some resource isolation, let’s modify our design pattern to use a single fleet even if our clusters remain in different projects:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521122306/a5005950-cd02-4996-8ced-1a5079e0ea7c.png" alt class="image--center mx-auto" /></p>
<p>Using a single fleet achieves the maximum level of “sameness” across our clusters. In other words, we are back to using a single source of truth to define configurations, policies and namespaces. However, we haven’t eradicated the complexity completely, we’ve merely shifted it to a different set of tools. Now we’ll need to manage different namespaces for different environments (such as <code>frontend-staging</code> and <code>frontend-prod</code>), and potentially rely on a service mesh to decide which services may communicate across cluster and environment boundaries. This arrangement does however guarantee consistency between our development and production environments, which is important when testing application changes before promoting them to production.</p>
<h3 id="heading-production-and-non-production-fleets">Production and non-production fleets</h3>
<p>Our final approach achieves a compromise, combining staging and development clusters into a single fleet to reduce some management complexity, while keeping the hard isolation of a separate fleet for production. This approach additionally allows you to test changes to fleet features and policies before promoting them to production.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521181564/4b0451bc-53be-42dd-b619-4980cff5f1bd.png" alt class="image--center mx-auto" /></p>
<p>These examples represent just some basic examples of multi-cluster configurations. You may choose one of these or a different variation based on your own team or business requirements. Generally, Google’s recommended best practices suggest that you minimise the differences between production and non-production environments. If developers cannot build and test applications in an environment with parity to production, the chances of successful changes and deployments are reduced. This can be achieved through enforcing configuration management and using GitOps practices to manage the differences between a production and non-production source of truth.</p>
<p>There are other reasons to run multiple clusters despite separating environments. Often, an organization will want to run clusters in multiple locations to bring services closer to their users. Running multiple clusters in different regions also significantly increases availability in the event of downtime in a particular region, which may be a requirement of your organization. Sometimes, data must be held in a specific locality, such as data on European consumers which may be subject to EU data legislation.</p>
<p>Now, to help us map out use cases for multiple clusters, it's important to understand the conceptual terms of north-south and east-west routing in GKE Enterprise.</p>
<h3 id="heading-north-south-routing">North-South routing</h3>
<p>If you visualise your system in a diagram, think of the outside world at the top of the drawing and all your clusters next to each other below that. North-south routing in this pattern represents traffic or communications coming into your system from outside. Typically, this would be traffic from the Internet received via a Load Balancer, but it could also be communications inside Google Cloud that are still considered outside of your fleet of clusters. The important thing is the communication enters your system via a north-south route.</p>
<h3 id="heading-east-west-routing">East-West routing</h3>
<p>East-west routing by comparison represents communication between your clusters. The traffic has already entered your system, or it may have originated there, but it may be routed to different clusters inside the system based on different variables such as availability and latency. In a moment we’ll look at some basic patterns to help us understand these concepts, and this will lay the foundation for us to explore more advanced east-west patterns when we get to service mesh later in the series.</p>
<h2 id="heading-multi-cluster-services">Multi-cluster Services</h2>
<p>As we’ve already established, there are many use cases for running multiple clusters such as increased availability, capacity or data locality. Now that we’re running multiple clusters within a single fleet, we can start treating the fleet like one big virtual cluster. But how do we enable services and workloads to communicate across the cluster boundaries of our fleet?</p>
<p>We’re of course aware of the humble Kubernetes <code>Service</code> object, which creates a <code>ClusterIP</code> and a logical routing to a set of Pods defined by some label selectors. However, a <code>Service</code> only operates within the confines of a single cluster. To access a <code>Service</code> running on a different cluster within the same fleet in GKE Enterprise, we can configure something called a <strong>Multi-cluster Service</strong> (MCS). This configuration object creates a fleet-wide discovery mechanism, allowing any workload to access a <code>Service</code> anywhere in the fleet. Additionally, MCS objects can reference services running on more than one cluster for high availability.</p>
<p>To create an MCS, you <em>export</em> an existing service from a specific cluster. The MCS controller will then create an <em>import</em> of that service on all other clusters in the fleet. The controller will configure firewall rules between your clusters to allow Pod to Pod communication, set up Cloud DNS records for exported services, and use Google’s Traffic Director as a control plane to keep track of service endpoints.</p>
<p>At the time of writing, MCS objects were only supported on VPC-native GKE clusters running in Google Cloud, where the clusters can communicate either on the same VPC network (including a Shared VPC network) or a peered VPC network. MCS services cannot span clusters outside of a single project. Depending on your use case, these may be significant constraints. However, the MCS object is designed to be a simplistic approach to extending a standard <code>Service</code> object. As we’ll see later in future posts, more advanced patterns can be achieved using a service mesh.</p>
<h3 id="heading-an-example-multi-cluster-service">An example multi-cluster service</h3>
<p>Let’s walk through a basic example where we have a fleet of clusters providing a middleware service for our application stack, in this case, an imaginary service called <code>checkfoo</code>. As illustrated below, our clusters are distributed geographically so that frontend services run as close to our users as possible, but right now the <code>checkfoo</code> service only runs on a single cluster in the <code>europe-west2</code> region. For simplicity, we’ll exclude the actual Pods from these diagrams. All clusters belong to the same fleet.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521527974/9bd21e08-83b4-48c4-b1f6-5bf19411c191.png" alt class="image--center mx-auto" /></p>
<p>Because the <code>checkfoo</code> service runs on a ClusterIP on the <code>prod-eu-1</code> cluster, it is currently only accessible from that cluster. So let’s fix that! The first thing we’ll need to do is enable multi-cluster services on the fleet. This can be done from the Cloud Console, or via the command line:</p>
<pre><code class="lang-bash">gcloud container fleet multi-cluster-services <span class="hljs-built_in">enable</span>
</code></pre>
<p>Next, we’ll need to enable Workload Identity Federation for our clusters. When we set up multi-cluster services, GKE deploys a component to our clusters called the <code>gke-mcs-importer</code>, which needs permissions to read information about your VPC network and incoming traffic.</p>
<p><em>Side note: Workload Identity Federation is a very useful feature, which allows you to map Kubernetes service accounts to IAM service accounts and leverage those service accounts’ permissions for your Kubernetes workloads. You can read more about it here:</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity"><em>https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity</em></a></p>
<p>If you’re using Autopilot clusters, Workload Identity Federation will already be enabled by default. I <em>think</em> it may soon be the default for standard clusters too, but if you’re working with existing clusters, you may have to enable it manually. Here’s an example for our <code>prod-eu-1</code> cluster:</p>
<pre><code class="lang-bash">gcloud container fleet memberships register prod-eu-1 \
   --gke-cluster europe-west2/prod-eu-1 \
   --enable-workload-identity
</code></pre>
<p>Once Workload Identity Federation is enabled, we need to grant the required roles to the <code>gke-mcs-importer</code> service account (don’t forget to replace <code>my-project-id</code> in these examples with your own project ID):</p>
<pre><code class="lang-bash">gcloud projects add-iam-policy-binding my-project-id \
    --member <span class="hljs-string">"serviceAccount:my-project-id.svc.id.goog[gke-mcs/gke-mcs-importer]"</span> \
    --role <span class="hljs-string">"roles/compute.networkViewer"</span>
</code></pre>
<p>All that’s left to do now is to create a <code>ServiceExport</code> object on the cluster that is hosting our service. This simple object essentially just flags the target <code>Service</code> object for export and will be picked up by the MCS controller. Note in the example below, our service is running a namespace called <code>foo</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceExport</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">net.gke.io/v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">foo</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">checkfoo</span>
</code></pre>
<p>There is quite a bit of scaffolding for the controller to set up the first time you create a <code>ServiceExport</code>, so it may take up to 20 minutes to synchronise changes to all other clusters in the fleet. Further exports and updates should only take a minute or so.</p>
<p>The process includes creating the matching <code>ServiceImport</code> object on every cluster, allowing the original service to be discovered, as well as setting up Traffic Director endpoints and health checks, plus the necessary firewall rules. Finally, MCS will register a cross-cluster FQDN in Cloud DNS that will resolve from anywhere in the fleet. In this example, the FQDN would be: <code>checkfoo.foo.svc.clusterset.local</code> (combining the service name <code>checkfoo</code>, with the namespace <code>foo</code>).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521820416/daf26eeb-b09d-41c9-9de5-d177031f1f71.png" alt class="image--center mx-auto" /></p>
<p>Note that at this stage, we could deploy a further <code>checkfoo</code> service to the <code>prod-eu-2</code> cluster as well. Unlike a regular <code>Service</code>, an MCS can also use clusters as selectors, selecting any matching Pod on any cluster where those Pods may be running. The MCS provides a simple east-west internal routing for multi-cluster services, but as we’ll learn next, it provides a backbone for north-south incoming requests as well.</p>
<h2 id="heading-multi-cluster-gateways">Multi-cluster Gateways</h2>
<p>Controlling incoming traffic or network ingress for our Kubernetes clusters has taken many forms over the years as it has evolved. We started by integrating Load Balancer support into the humble <code>Service</code> object, then moved on to more advanced implementations of ingress with <code>Ingress</code> and <code>IngressController</code> objects. The current way forward is based on the Kubernetes project’s Gateway API, a more role-oriented, portable and expressive way to define service networking for Kubernetes workloads.</p>
<p>Inside Google Cloud, the <strong>GKE Gateway Controller</strong> implements the Gateway API and provides integration with Google’s cloud Load Balancing services. The Gateway controller is a managed service that runs in your project, but not inside your clusters themselves. It watches for changes in managed clusters, and then deploys and manages the necessary Load Balancing and other network services for you. The Gateway controller has an extensive list of features:</p>
<ul>
<li><p>Support for internal and external load balancing, with HTTP, HTTPS and HTTP/2</p>
</li>
<li><p>Traffic splitting and mirroring</p>
</li>
<li><p>Geography and capacity-based load balancing</p>
</li>
<li><p>Host, path and header-based routing</p>
</li>
<li><p>Support for enabling and managing supporting features such as Google Cloud Armor, Identity-Aware Proxy and Cloud CDN</p>
</li>
</ul>
<p>The Gateway Controller is available as a single-cluster or multi-cluster implementation. Conceptually the two approaches are the same, but we’ll explore the multi-cluster implementation in this post on the assumption that your fleets will be running more than one cluster.</p>
<h2 id="heading-configuring-multi-cluster-gateways">Configuring Multi-cluster Gateways</h2>
<p>If you’re already familiar with the Kubernetes Gateway API, it will help you understand how Google has implemented this API to support a multi-cluster gateway. If this is completely new to you, you might want to review the <a target="_blank" href="https://kubernetes.io/docs/concepts/services-networking/gateway/">documentation here first</a>.</p>
<p>The GKE Gateway Controller provides you with multiple <code>GatewayClasses</code> that can be used to define a gateway. These are:</p>
<ul>
<li><p><code>gke-l7-global-external-managed-mc</code> for global external multi-cluster Gateways</p>
</li>
<li><p><code>gke-l7-regional-external-managed-mc</code> for regional external multi-cluster Gateways</p>
</li>
<li><p><code>gke-l7-rilb-mc</code> for regional internal multi-cluster Gateways</p>
</li>
<li><p><code>gke-l7-gxlb-mc</code> for global external Classic multi-cluster Gateways</p>
</li>
</ul>
<p>You choose a class when you create the <code>Gateway</code> object, as we’ll see in a moment. Just like in our previous example, a typical scenario for a multi-cluster gateway would involve multiple clusters, but these must all be in the same fleet and have Workload Identity Federation enabled.</p>
<p>You’ll need to choose one of your clusters to act as the config cluster. This cluster will host the Gateway API resources (<code>Gateway</code>, <code>Routes</code> and <code>Policies</code>) and control routing across all your clusters. This does however make the config cluster a potential single point of failure, because if its API is unavailable, gateway resources cannot be created or updated. For this reason, it's recommended to use regional rather than zonal clusters for high availability. It doesn’t have to be a dedicated cluster – it may also host other workloads, but it does need to have all of the namespaces that will be used by target clusters set up, and any user who will need to create ingress services across the fleet will need access to the config cluster (although potentially only for their own namespace).</p>
<p>Assuming we already have multi-cluster services enabled in our fleet (as described earlier), we can now enable the multi-cluster gateway by nominating a config cluster, such as the <code>prod-us</code> cluster from our previous example:</p>
<pre><code class="lang-bash">gcloud container fleet ingress <span class="hljs-built_in">enable</span> \
    --config-membership=projects/my-project-id/locations/us-central1/memberships/prod-us
</code></pre>
<p>Next, we need to grant IAM permissions required by the gateway controller. Note that in this command, you will need your project ID <em>and</em> your project number, which you can find in the Cloud Console. In this example, we’re using the fake project number of <code>555123456555</code> just so you can see how this references the relevant service account:</p>
<pre><code class="lang-bash">gcloud projects add-iam-policy-binding my-project-id \ 
    --member <span class="hljs-string">"serviceAccount:service-555123456555@gcp-sa-multiclusteringress.iam.gserviceaccount.com"</span> \ 
    --role <span class="hljs-string">"roles/container.admin"</span>
</code></pre>
<p>We can now confirm that the GKE Gateway Controller is enabled for our fleet, and that all the different <code>GatewayClasses</code> exist in our config cluster, with the following two commands:</p>
<pre><code class="lang-bash">gcloud container fleet ingress describe
kubectl get gatewayclasses
</code></pre>
<p>We’re now ready to configure a gateway! For this example, we’ll deploy a sample app to both of our European clusters, then configure the services to be exported as a multi-cluster service, and finally set up an external multi-cluster Gateway.</p>
<p>We’ll use the <code>gke-l7-global-external-managed-mc</code> gateway class, which provides us with an external application load balancer. When we’ve finished, external requests should be routed to either cluster based on the cluster health and capacity, <em>and</em> the proximity of the cluster to the user making the request. This provides the lowest latency and best possible service for the end-user. The app itself is just a basic web server that tells us where it’s running to help us test this.</p>
<p><strong><em>Another quick aside:</em></strong> <em>This stuff is cutting edge! At the time of writing, the command to enable ingress and set up the controller didn’t exactly have a 100% success rate. If at this stage you don’t see any</em> <code>GatewayClasses</code> <em>in your config cluster, you can force the controller to install with this command:</em></p>
<pre><code class="lang-bash">gcloud container clusters update --gateway-api=standard --zone=&lt;cluster zone&gt;
</code></pre>
<p><em>Substitute</em> <code>--region=&lt;cluster-region&gt;</code> <em>for regional clusters. Also, if you see some classes, but not the multi-cluster ones (denoted with the suffix</em> <code>mc</code><em>), try disabling the fleet ingress feature and re-enabling it sometime later (after you have forced the controller to install). Hopefully some of these bugs will get ironed out as the features mature!</em></p>
<h3 id="heading-deploying-the-app-and-setting-up-the-multi-cluster-service">Deploying the app and setting up the multi-cluster service</h3>
<p>Using our existing fleet, we’ll now create a namespace across all our clusters, and deploy a test app to just our EU clusters. Then we’ll set up the <code>Gateway</code> and <code>HTTPRoute</code> on our config cluster, and test ingress traffic from different network locations. Let’s imagine we’re creating something like an e-commerce store, but again we’ll just be using an app that shows us the <code>Pod</code> it’s running from.</p>
<p>First, we'll create the <code>store</code> namespace on all the clusters in the fleet:</p>
<pre><code class="lang-bash">kubectl create ns store
</code></pre>
<p>You need to repeat this for every cluster; you could use different contexts in <code>kubectl</code> or a tool like <code>kubectx</code>.</p>
<p>Next, we’ll create the <code>store</code> deployment. This is just a basic 2-replica deployment that uses Google’s <code>whereami</code> container to answer requests with details about where it is running. We'll use the following YAML, and we’ll create this deployment on the <code>prod-eu-1</code> and <code>prod-eu-2</code> clusters. (Just write the YAML to a file and use <code>kubectl apply</code>):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">store</span>
      <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">store</span>
        <span class="hljs-attr">version:</span> <span class="hljs-string">v1</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">whereami</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">us-docker.pkg.dev/google-samples/containers/gke/whereami:v1.2.20</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
</code></pre>
<p>We’ll also create a <code>Service</code> object for each deployment, as well as a <code>ServiceExport</code>, so that the MCS controller can pick it up and create a <code>ServiceImport</code> on every cluster in the fleet. Once again, the YAML required is very simple:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">store</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceExport</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">net.gke.io/v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
</code></pre>
<p>These objects are applied to the <code>prod-eu-1</code> and <code>prod-eu-2</code> clusters, where the previous deployments were also applied. Unlike our intentions in the first part of this post, we’re not doing this for cluster-to-cluster communication, but to help our <code>Gateway</code> find our services on either cluster. A corresponding <code>ServiceImport</code> will be created on any cluster that does not host the <code>ServiceExport</code> (in this case, <code>prod-us</code>), and <em>this</em> object will be used by our <code>Gateway</code> as a logical identifier for a backend service, pointing it at the other cluster endpoints. Using a <code>ServiceImport</code> as a backend instead of a <code>Service</code> means our target Pods could run on any cluster in the fleet!</p>
<p>Here’s what it all looks like now:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729545117610/c4118868-79cf-49a4-bba6-fe576db7598a.png" alt class="image--center mx-auto" /></p>
<p>With our multi-cluster workload running happily in our fleet, we can now set up the necessary objects to expose it to the outside world. As we’ve already enabled the multi-cluster gateway controller, we can now create a <code>Gateway</code> object which defines how we would like to create and leverage a Load Balancer.</p>
<p>We’ll use the <code>gke-l7-global-external-managed-mc</code> class for our <code>Gateway</code>, which specifies that we want to use the global HTTPs load balancer with multi-cluster support. The separation of duties enabled by the Kubernetes Gateway API means that we create the Load Balancer now with this object, but we can configure HTTP routes (in other words, its URL map) later.</p>
<p>We’ll deploy this configuration object to our config cluster (in our example, <code>prod-us</code>):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">kind:</span> <span class="hljs-string">Gateway</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">gateway.networking.k8s.io/v1beta1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">external-http</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">gatewayClassName:</span> <span class="hljs-string">gke-l7-global-external-managed-mc</span>
  <span class="hljs-attr">listeners:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">http</span>
    <span class="hljs-attr">protocol:</span> <span class="hljs-string">HTTP</span>
    <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">allowedRoutes:</span>
      <span class="hljs-attr">kinds:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">HTTPRoute</span>
</code></pre>
<p>Finally, we’ll add an <code>HTTPRoute</code> to the config cluster to configure how we would like the load balancer to route requests to our backend services (or service imports in this case):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">kind:</span> <span class="hljs-string">HTTPRoute</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">gateway.networking.k8s.io/v1beta1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">public-store-route</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">store</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">gateway:</span> <span class="hljs-string">external-http</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">hostnames:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">"store.example.com"</span>
  <span class="hljs-attr">parentRefs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">external-http</span>
  <span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">backendRefs:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">group:</span> <span class="hljs-string">net.gke.io</span>
      <span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceImport</span>
      <span class="hljs-attr">name:</span> <span class="hljs-string">store</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
</code></pre>
<p>This is a very basic example, where all requests to the load balancer that contain the hostname <code>store.example.com</code> in the header will be routed to the <code>ServiceImport</code> called <code>store</code>. Of course, you have the full power of the Kubernetes Gateway API at your disposal, and you could configure your <code>HTTPRoute</code> object to route to multiple different backends based on URL path, HTTP headers or even query parameters. For more details see <a target="_blank" href="https://gateway-api.sigs.k8s.io/api-types/httproute/">https://gateway-api.sigs.k8s.io/api-types/httproute/</a></p>
<p>After a few minutes, the <code>Gateway</code> should be ready. You can confirm its status with this command:</p>
<pre><code class="lang-bash">kubectl -n store describe gateways.gateway.networking.k8s.io external-http
</code></pre>
<p>Now we just need to grab the external IP address to test a connection. We get this with a modified <code>kubectl</code> command:</p>
<pre><code class="lang-bash">kubectl get gateways.gateway.networking.k8s.io external-http -o=jsonpath=<span class="hljs-string">"{.status.addresses[0].value}"</span> --context prod-us --namespace store
</code></pre>
<p>We can use <code>curl</code> from the command line to test the external IP, passing in a header to request the hostname we specified in the <code>HTTPRoute</code>. In this example, replace the sample IP address with the one you got from the previous command:</p>
<pre><code class="lang-bash">curl -H <span class="hljs-string">"host: store.example.com"</span> http://142.250.200.14
</code></pre>
<p>You should see some output similar to the screenshot below, showing a response from the <code>Pod</code> that includes its cluster name, Pod name, and even the Compute Engine instance ID its running from. Which region serves your request will depend on your current location (or if you’re using the Cloud Shell terminal, the location of your Cloud Shell VM).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729591091596/f91ebbc8-2e89-487e-870f-954993e211be.png" alt class="image--center mx-auto" /></p>
<p>We can experiment with the locality-based routing by creating test VMs in different regions. For example, if we create a test VM in <code>europe-west4</code>, we should see our request served by the <code>prod-eu-2</code> cluster:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729591154050/2fc54031-7e25-4a0f-9186-a726f37997b5.png" alt class="image--center mx-auto" /></p>
<p>We have now successfully configured a multi-cluster service running across different regions in our fleet, with a single global HTTPS load balancer directing incoming traffic to the backend that is closest to our users and in a healthy state. Additionally, we know from the power of the <code>HTTPRoute</code> object that we could introduce URL maps and more complex route matching for multiple backend services and service imports. In a later post we’ll look at how to integrate additional Google Cloud services into gateways, such as Cloud Armor for web-application firewall protection, and the Identity-Aware-Proxy (IAP).</p>
<p>The final state of our example fleet, including the flow of requests, is shown below. Note that despite the seemingly complex logic we’ve set up, the actual flow of traffic goes directly from the load balancer frontend to the <code>Pods</code> themselves via Network Endpoint Groups (NEGs):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729591217823/e75a3437-d426-40a7-a878-3a5d066f8067.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-summary">Summary</h2>
<p>In the first few posts in this series we introduced some of the new fundamental concepts that set GKE Enterprise apart from a standard Kubernetes infrastructure deployment. Along the way we’ve gradually built up our knowledge of fleets, and you should now have a good idea of how you might design your own systems based on multiple clusters and potentially multiple fleets.</p>
<p>Of course there are many ways to achieve a desired outcome, and each will have their own tradeoffs. It’s worth sketching out the possibilities before committing to any design, and if you commit to reading future posts in this series (thank you!) you should still feel free to revisit the ideas in this post on how to achieve different levels of isolation or management complexity.</p>
<p>Hopefully you’ve finished this post with an understanding of how to create a truly multi-cluster deployment. While the fully managed Gateway controller dynamically provisions and operates load balancers for us, you may be wondering about its limitation to back-ends that run inside Google Cloud clusters. There are two scenarios where this limitation may impact you:</p>
<ol>
<li><p>If your back-end service runs in an on-premises cluster, but you still want to use a Google Cloud load balancer. According to Google Cloud, this isn’t exactly a preferred design pattern. However, it is still achievable outside of GKE Enterprise services using Network Endpoint Groups with hybrid connectivity. For more information see <a target="_blank" href="https://cloud.google.com/load-balancing/docs/negs/hybrid-neg-concepts">https://cloud.google.com/load-balancing/docs/negs/hybrid-neg-concepts</a>.</p>
</li>
<li><p>If your front-end workloads run in Google Cloud, but they need to communicate with workloads on external clusters.</p>
</li>
</ol>
<p>While this arrangement isn’t supported by a standard MCS, we’ll soon learn that this can be achieved, along with so much more, in the amazing world of service mesh. We’ve got the fundamentals out of the way, so in the next post I can really dig into the nuts and bolts of Istio and use service mesh for traffic routing, service identity, service security and more!</p>
]]></content:encoded></item><item><title><![CDATA[Configuration Management in GKE Enterprise]]></title><description><![CDATA[This is the fourth post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to sup...]]></description><link>https://timberry.dev/configuration-management-in-gke-enterprise</link><guid isPermaLink="true">https://timberry.dev/configuration-management-in-gke-enterprise</guid><category><![CDATA[config controller]]></category><category><![CDATA[config sync]]></category><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[configuration management]]></category><category><![CDATA[#policy-controller]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Tue, 08 Oct 2024 10:21:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729609883021/34264141-bc74-4552-acaf-0a50fa001458.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the fourth post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>In this series we have now covered some of the basics of building clusters in GKE Enterprise, and we can begin to explore some of its more useful features, starting in this post with configuration management. Before you drift off to sleep, let me tell you why this is important stuff.</p>
<p>For those of us who have worked in IT for a long time, the concept of configuration management may bring back memories of tools like Puppet and Chef, which were attempts at maintaining the state of systems by declaring their configuration with some sort of code. Their overall success was mixed, but they inspired new ways of thinking about automation and systems management, such as declarative infrastructure and GitOps – both topics we will explore in this post. There are many different reasons to consider implementing a configuration management system, but probably the two most relevant concepts are <strong>automation</strong> and <strong>drift</strong>.</p>
<p>As systems engineers, we always strive to automate manual tasks. If you successfully introduce Kubernetes to your organisation, you may soon find yourself managing multiple clusters for multiple teams, which could mean lots of repetitive work installing, upgrading, fixing, and securing. Wouldn’t it be better to automate this based on a standard template of what clusters should look like? And whether you complete this work manually or via automation, what happens next? Configurations are likely to drift over time due to small or accidental changes or even failures in resources. If the state of your clusters has drifted from what it should be, you’re going to have a much harder time maintaining it.</p>
<p>So, let’s learn about some of the configuration management tools GKE Enterprise gives us to solve these problems. By the end of this post, you will hopefully understand:</p>
<ul>
<li><p>The concepts of declarative infrastructure and GitOps</p>
</li>
<li><p>How to use GKE’s Config Sync to manage cluster state with Git</p>
</li>
<li><p>How the Policy Controller can be used to create guardrails against unwanted behaviour in your clusters</p>
</li>
<li><p>How you can extend configuration management to other Google Cloud resources with the Config Controller</p>
</li>
</ul>
<p>Let’s get started!</p>
<h2 id="heading-understanding-declarative-infrastructure">Understanding Declarative Infrastructure</h2>
<p>The tools and technologies we will learn about in this post are all based on the concept of a <strong>declarative model</strong> of configuration management, so it’s important to understand this fundamental concept. In the history of DevOps, there have generally been two schools of thought about which model should be used in configuration management: <strong>declarative</strong> or <strong>imperative</strong>.</p>
<p>In a nutshell, the declarative model focuses on what you want, but the imperative model focuses more on how you should get it. To clarify that further, a declarative model is based on you declaring a desired state of configuration. If you’re familiar with Kubernetes but came to it years after systems like Puppet and Chef fell out of fashion, then you may never have considered that people did this any other way. Declarative models make the most sense because if the desired state of a system is declared in code, the code itself becomes living documentation of the state of a system. Further, if the software that applies this code is clever enough, it can maintain the state – in other words, guarantee that it remains true to what is declared, and thereby solve the problem of configuration drift.</p>
<p>The imperative model, by focusing on the steps required to reach a desired state, is a weaker approach. The administrator of an imperative system must be sure that each step is taken in the correct order and manage the dependencies and potential failure of each step. Because an imperative model does not declare the end-state of a system, it is harder to use this model to maintain a state once it is built, so it doesn’t help solve the drift problem.</p>
<p><em>A quick side note: You may absolutely disagree with me here 😀 My descriptions of declarative and imperative models are based on my own experience and opinion. You are perfectly entitled to disagree and may have experiences that are very different to my own! For example, some imperative systems claim to have solved the drift problem. For this post’s purposes however, all the tools we will discuss implement a declarative model.</em></p>
<h3 id="heading-infrastructure-as-code">Infrastructure as Code</h3>
<p>While configuration management tools have existed in some form or another for over 20 years now, the concept of Infrastructure as Code emerged as we began to use these tools to manage the lifecycle of virtual machines and other virtualised resources in addition to software and services. The most successful of these tools is arguably Terraform, which has broad support for multiple providers allowing you to take a declarative approach to defining what virtual infrastructure you would like, what other services you need to support the infrastructure, how they should operate together, and ultimately what applications they should run.</p>
<h3 id="heading-gitops">GitOps</h3>
<p>While Terraform can be used to declaratively manage almost anything, these days it is common to pick the best tool for each individual job and combine them with automation. This might mean using Terraform to declare Kubernetes infrastructure resources, then applying Kubernetes objects with <code>kubectl</code>. While we may use different tools to manage different parts of our systems, we still need a common way to automate them together, which brings us to GitOps.</p>
<p>The purpose of GitOps is to standardise a git repository as the single source of truth for all infrastructure as code and configuration management, and then automate the application of different tools as part of a continuous process. When code is committed to a repo, it can be checked and tested as part of a <strong>continuous integration</strong> (CI) process. When changes have passed tests and potentially a manual approval, a <strong>continuous delivery</strong> (CD) service will then invoke the necessary tooling to make the changes. The living documentation of our system now lives in this git repo, which means that all its changes are tracked and can be put through this rigorous testing and approval cycle, improving collaboration, security, and reliability.</p>
<p>The GKE Enterprise approach to configuration management means that artifacts like configuration and security policies now live in a git repo and are automatically synchronised across multiple clusters. This gives you a simplified, centralised approach to managing the configuration and policies of different clusters in different environments without having to worry about configuration drift.</p>
<p>But it’s important to understand where these tools sit within an overall GitOps approach:</p>
<ul>
<li><p>The GKE Enterprise config management tools focus on configuration only (deploying shared configuration objects and policy controllers that we’ll learn about in this post)</p>
</li>
<li><p>They <em>do not</em> automate the creation of clusters (Terraform is recommended for this instead)</p>
</li>
<li><p>They are <em>not recommended</em> for the automation of workloads on clusters. Workloads are expected to be frequently changed and updated, so they lend themselves to different tooling and processes. I’ve got plans to write about CI/CD practices for workloads later in this series.</p>
</li>
</ul>
<p>Now that we’ve clarified this position, let’s start learning about the individual tools themselves, starting with Config Sync.</p>
<h2 id="heading-using-gke-config-sync">Using GKE Config Sync</h2>
<p>The purpose of the <strong>Config Sync</strong> tool is to deploy consistent configurations and policies across multiple clusters, directly from a git repository. This gives us the benefits of using git such as accountability, collaboration, and traceability for changes, but it also guards against configuration drift with options for self-healing and periodic re-syncs. Clusters are enrolled in Config Sync via fleet management, and it is recommended to enroll all clusters in a fleet to ensure a consistent approach across every cluster.</p>
<p>To set up Config Sync, we must first create a git repo containing the desired state of our cluster configurations. When we enable Config Sync, a Kubernetes operator is used to deploy a Config Sync controller to our clusters. The controller will pull the configuration from the git rep, apply it to the cluster, and then continue to watch for any changes so that it can reconcile its current state with the desired state declared in the repo.</p>
<p>By default, these checks happen every hour. If we optionally enable drift prevention, every single request to the Kubernetes API server to modify an object that is managed by the Config Sync controller will be intercepted. The controller will validate that the change will not conflict with the state defined in the repo before allowing the change. Even with drift prevention enabled, the controller will continue its regular re-syncs and self-healing activity.</p>
<h3 id="heading-config-repos">Config repos</h3>
<p>Config Sync supports two different approaches to laying out files in a git repo for configuration management. <strong>Hierarchical</strong> repositories use the filesystem-like structure of the repo itself to determine which namespaces and clusters to apply a configuration to. If you choose the hierarchical option, you must structure your repo this way. An example structure is shown below, where we have configurations that apply to all clusters, and some that are specific to namespaces owned by two different teams.</p>
<pre><code class="lang-bash">.
├── README.md
├── cluster
│   ├── clusterrole-ns-reader.yaml
│   └── clusterrolebinding-ns-reader.yaml
├── namespaces
│   ├── blue-team
│   │   ├── namespace.yaml
│   │   └── network-policy.yaml
│   ├── limits.yaml
│   └── red-team
│       ├── namespace.yaml
│       └── network-policy.yaml
└── system
    └── repo.yaml
</code></pre>
<p>The <code>cluster/</code> directory contains configurations that will apply to entire clusters and are not namespaced. In this example, we create a <code>ClusterRole</code> and a <code>ClusterRoleBinding</code>. We could also optionally use a <code>ClusterSelector</code> object to limit the scope of these to specific clusters.</p>
<p>The <code>namespaces/</code> directory contains configurations that will apply to specific namespaces – although again these could be on any cluster or a scoped subset of clusters. In this example, we define the namespace itself in the <code>namespace.yaml</code> file and then apply a specific network policy with the <code>network-policy.yaml</code> file. We could include any other additional namespace-specific configuration we want, in each namespace’s own sub-directory. Note the name of the sub-directory must match the name of the namespace object that is defined.</p>
<p>Also in this example, the <code>namespaces/</code> directory also contains a standalone file, <code>limits.yaml</code>, that will be applied to all namespaces. Because this repo is hierarchical, we can take advantage of this sort of inheritance.</p>
<p>Finally, the <code>system/</code> directory contains the configuration for the Config Sync operator itself. A valid structured repository must contain the cluster, namespaces and system directories.</p>
<p>The alternative to a hierarchical repo is <strong>unstructured</strong>, and these are recommended for most users, giving you the most flexibility. In an unstructured repo, you are free to organise your files as you see fit. When you configure Config Sync, you tell it where to look in your repo, which means your repo could theoretically contain other data such as Helm charts. Which configurations are applied where can be determined by the use of <code>ClusterSelector</code> objects, <code>NamespaceSelector</code> objects, or just namespace annotations in metadata.</p>
<h3 id="heading-enabling-config-sync-on-a-cluster">Enabling Config Sync on a cluster</h3>
<p>Let’s walk through a quick demonstration of adding Config Sync to a GKE cluster. We’ll assume that we have already built a GKE Enterprise autopilot cluster running in Google Cloud, and that the cluster is registered to your project’s fleet (take a look at previous posts in this series if you need help getting set up - <em>and always be mindful of the costs of playing with these features!</em>)</p>
<p>We’ll set up the Config Sync components and get them to apply the demo configurations from Google’s quick start repo, which you can find here: <a target="_blank" href="https://github.com/GoogleCloudPlatform/anthos-config-management-samples/tree/main/config-sync-quickstart">https://github.com/GoogleCloudPlatform/anthos-config-management-samples/tree/main/config-sync-quickstart</a></p>
<p>If we take a quick look through the <code>multirepo/</code> directory of this repo, we’ll find a further <code>namespaces/gamestore</code> directory containing two Kubernetes objects to deploy to the <code>gamestore</code> namespace. There is also a <code>root/</code> directory containing objects that will be deployed to the entire cluster, including cluster roles, role bindings and custom resource definitions. Take a look at the YAML code and familiarize yourself with these objects.</p>
<p>In the GKE Enterprise section of the Google Cloud console, you should be able to find the <strong>Config Sync</strong> section in the left-hand menu, under <strong>Features</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727778775414/1ed3497b-b98e-4dff-9cb4-7c6b14824d37.png" alt class="image--center mx-auto" /></p>
<p>On this page, we can select <strong>Install Config Sync</strong> which will bring up an installation dialog box. Here we can choose which clusters to target for the installation, and which version of Config Sync to use. The recommended course of action is to install Config Sync across all clusters in the fleet. For this demonstration, we’ll just select the cluster we have available and the latest version.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727778824676/d9164133-f6b0-43dd-9617-b9f744156527.png" alt class="image--center mx-auto" /></p>
<p>While the dashboard won’t update straight away, in the background GKE is now prepared to initialize the installation of Config Sync. After a few minutes, the Config Sync status should show that you have one cluster in a pending state, and after a few more minutes this state should change to Enabled. At this point, Config Sync has installed its operator on your cluster, but it still doesn’t have any actual configuration to synchronise.</p>
<p>Config Sync requires a <strong>source of truth</strong> that it can use to obtain configurations and other objects that it should deploy and manage to your clusters. The most common approach to creating a source of truth is to use a git repository as we have already discussed, but it’s also possible to use OCI container repositories for synchronising container images to your clusters.</p>
<p>Config Sync refers to these sources as <strong>Packages</strong>, so to set one up we will go to the <strong>Packages</strong> section of the Config Sync page and select <strong>Deploy Package</strong>. We will choose the git repository option and select our cluster. Now we need to enter a few details about the package:</p>
<ul>
<li><p><strong>Package name</strong>: This will be the name of the configuration object created to match this package. We’ll use <code>sample-repository</code> for now, as we’re using Google’s sample repo.</p>
</li>
<li><p><strong>Repository URL</strong>: The git repo URL we mentioned earlier, which is: <a target="_blank" href="https://github.com/GoogleCloudPlatform/anthos-config-management-samples">https://github.com/GoogleCloudPlatform/anthos-config-management-samples</a></p>
</li>
<li><p><strong>Path</strong>: We need to point to the correct part of our repo that contains the objects we wish to synchronise, which is <code>config-sync-quickstart/multirepo/root</code></p>
</li>
</ul>
<p>We can now click <strong>Deploy Package</strong> and let Config Sync do its work. Specifically, Config Sync will now look at the objects declared in the repo and make sure that they exist on our cluster. In its current state, these objects have never been created, so Config Sync will reconcile that by creating them. The overall package is defined as a <code>RootSync</code> object, one of the Config Sync custom resource definitions, which holds all of the setup information you just entered. This object gets created in the <code>config-management-system</code> namespace.</p>
<p>You’ll see from the dashboard that your cluster has entered a <strong>reconciling</strong> synchronisation status. Once everything has been created, and the cluster state matches your declared state in the repo, the status should change to <strong>synced</strong>. The job of Config Sync then becomes to maintain these objects, so that the actual state of the cluster never drifts from what is declared in the repo.</p>
<p>You may have noticed at this point that two packages have synchronised when we only created one. How did that happen? The root part of the package that we synchronised contained 28 resources, and one of these was a <code>RepoSync</code> object which contained a reference to the <code>multirepo/namespaces/gamestore</code> path in our repo, creating the <code>gamestore</code> namespaced resources. This object is represented as an additional package, even though technically the source of truth is the same git repo. We could optionally have used a <code>RepoSync</code> object to point to a completely different repo.</p>
<h3 id="heading-examining-synchronised-objects">Examining synchronised objects</h3>
<p>There are two useful ways we can check on the work that Config Sync has carried out. Assuming we have configured authentication for <code>kubectl</code>, we can query for objects that contain the <code>app.kubernetes.io/managed-by=configmanagement.gke.io</code> label annotation. This will tell us which objects are managed by Config Sync. For example, we can list all namespaces that contain this label:</p>
<pre><code class="lang-bash">kubectl get ns -l app.kubernetes.io/managed-by=configmanagement.gke.io
</code></pre>
<p>You should see the <code>gamestore</code> and <code>monitoring</code> namespaces. Google also provides a command line tool called <code>nomos</code> as part of the Google Cloud CLI. We can use the <code>nomos</code> status command to view the state of synchronisation. For more information about this command, see <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/config-sync/docs/how-to/nomos-command">https://cloud.google.com/kubernetes-engine/enterprise/config-sync/docs/how-to/nomos-command</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727779471225/bf284280-2658-4ee4-9d13-89ba1c05cc67.png" alt class="image--center mx-auto" /></p>
<p>Let’s recap what we’ve just achieved and why it would be useful in a real environment. We’ve set up Config Sync for our cluster, but as we previously stated, it would be normal to enable Config Sync across an entire fleet of clusters. We’ve then deployed a Config Sync package, creating a code-based deployment of all the configuration we want across our clusters.</p>
<p>In this example repository, objects are defined for role-based access control, custom resource definitions, service accounts, namespaces, and monitoring services. These are the sort of supporting services that are normally configured as part of “Day 2” operations, but using Config Sync we’ve entirely automated them. Using git as our source of truth, we can now maintain a single code base that contains all the structural support our cluster needs, and as our fleet grows, Config Sync will ensure that these objects remain deployed and in their desired state across all our clusters no matter where they are running. We’ve seen how Config Sync can make sure that certain desirable objects are deployed to every cluster in your fleet. What other workloads and objects are then deployed on top of this baseline will be up to your developers and other teams who have access to your clusters. So how do you control what gets deployed?</p>
<h2 id="heading-using-the-gke-policy-controller">Using the GKE Policy Controller</h2>
<p>The Policy Controller is another tool that falls under the banner of configuration management for GKE Enterprise, and its job is to apply programmable policies that can act as guardrails. This allows you to set your own best practices for security and compliance and ensure that those policies are being applied uniformly across your entire fleet.</p>
<p>Under the hood, Policy Controller is built on the open-source <strong>Open Policy Agent</strong> (OPA) Gatekeeper project, which you can read more about at <a target="_blank" href="https://open-policy-agent.github.io/gatekeeper/website/docs/">https://open-policy-agent.github.io/gatekeeper/website/docs/</a>. The Policy Controller in GKE Enterprise extends the OPA Gatekeeper project by also providing an observability dashboard, audit logging, custom validation, and a library of pre-built policies for best practice security and compliance. In this way, you can view the GKE Policy Controller as a managed and opinionated version of the OPA Gatekeeper service. If you prefer, you can simply install and configure OPA Gatekeeper on its own. This will give you greater flexibility, but of course it will require more work on your part. If you wish to do this, you could consider automating its installation via Config Sync.</p>
<h3 id="heading-constraints">Constraints</h3>
<p>Policy Controller works at a low level by leveraging the validating admission controller provided by the OPA Gatekeeper project. This means that all requests to the Kubernetes API on a given cluster are intercepted by this admission controller and validated against predefined policies.</p>
<p>These policies are made of <strong>constraint objects</strong>. Each constraint can either actively block non-compliant API requests, or it can audit a request and report a violation. By default, constraints will audit or mutate requests (more on that later), and report violations which you can view in the Policy dashboard.</p>
<p>GKE provides dozens of constraint templates in a library which you can use to build policies that suit your own security and compliance requirements. To give you a feel for the type of constraints these templates can implement, here are just a few examples:</p>
<ul>
<li><p>The <code>AsmSidecarInjection</code> template ensures that the Istio proxy sidecar is always injected into workload Pods.</p>
</li>
<li><p>The <code>K8sAllowedRepos</code> template will only allow container images to be used where the source URL starts with an approved string.</p>
</li>
<li><p><code>K8sContainerLimits</code> and <code>K8sContainerRequests</code> will only allow containers to be deployed that have resource limits and requests set, and only if those limits are requests are within a specified threshold.</p>
</li>
<li><p>The <code>K8sRequiredProbes</code> template requires Pods to have readiness and/or liveness probes defined, or they cannot be scheduled.</p>
</li>
</ul>
<p>These are just a small sample from the available list, but they should give you an idea of the sort of compliance you can enforce on your cluster. The full list is available at: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/policy-controller/docs/latest/reference/constraint-template-library">https://cloud.google.com/kubernetes-engine/enterprise/policy-controller/docs/latest/reference/constraint-template-library</a></p>
<p>If you have a requirement to constrain an object that is not covered by one of these templates, you can create your own custom constraint template. This should be an outlier case however, as templates to cover most common situations are provided in the Policy Controller library, and writing custom templates requires an advanced knowledge of the OPA Constraint Framework and the Rego language that it uses.</p>
<h3 id="heading-policy-bundles">Policy Bundles</h3>
<p>While you can easily apply constraint templates individually, GKE Enterprise also gives you the option of applying pre-prepared policy bundles. These are sets of policies prepared by Google Cloud to apply best practices or to meet specific compliance or regulatory requirements.</p>
<p>For example, the Center for Internet Security (CIS) maintains a benchmark for hardening and securing Kubernetes. This benchmark contains dozens of requirements to ensure the security of the control plane, workers nodes, policies and the implementation of GKE by managed services. A CIS GKE Benchmark Policy Bundle is maintained by Google Cloud and made available in GKE Enterprise to allow you to apply these requirements across clusters in your fleet.</p>
<p>Alternatively, other bundles are available that are based on requirements from the (NIST), PCI-DSS requirements the National Security Agency National Institute of Standards and Technology from the PCI Security Standards Council, and even a set of hardening requirements from (NSA).</p>
<h3 id="heading-applying-policy-control-to-a-cluster">Applying Policy Control to a Cluster</h3>
<p>Let’s walk through a simple demonstration of setting up the Policy Controller for a GKE cluster and then applying some constraints from the template library. Just like before, we’ll assume that we have already built a GKE Enterprise autopilot cluster running in Google Cloud, and that the cluster is registered to your project’s fleet.</p>
<p>To install the Policy controller, go to the <strong>Policy</strong> page in the GKE section of the Google Cloud console. This page provides a dashboard for compliance with industry standards and best practices. With no clusters yet configured, you’ll just see some greyed-out example insights.</p>
<p>Select <strong>Configure Policy Controller</strong>. You will now see an overview of the Policy Controller feature for your fleet. This page will also inform you that some essential policy settings will now be applied to all clusters in your fleet.</p>
<p>You can choose to view or customise these settings by scrolling down and selecting <strong>Customize Fleet Settings</strong>, as shown below. The Policy Bundle pop-up will give you a choice of policy bundles to apply, but by default, the template library is enabled and so is the latest version of the Policy Essentials bundle. This bundle of best practices enforces policies for role-based access control, service accounts, and pod security policies.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727779923247/3e7d011f-0955-402c-b415-304e80eb8310.png" alt class="image--center mx-auto" /></p>
<p>You can cancel this customisation pop-up and select <strong>Configure</strong> at the bottom of the page. You will then have to confirm that you are happy to apply your new Policy settings to your fleet.</p>
<p>It will take several minutes for GKE to install all necessary components to enable the Policy Controller. This includes the <code>gatekeeper-controller-manager</code> deployment in a dedicated namespace which performs the key actions of the OPA Gatekeeper admission controller, such as managing the webhook that intercepts requests to the Kubernetes API. The controller also evaluates policies and enforcement decisions.</p>
<p>Our Policy dashboard should now show that our cluster is compliant with the Policy Essentials policy bundle that we enabled across our fleet, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727779986082/0d9c6e2a-7b5c-44ad-bac2-cadd19f9c00e.png" alt class="image--center mx-auto" /></p>
<p>Next, let’s test how the Policy Controller audits and reports on best practices. The Policy Essentials bundle that we already applied is surprisingly permissive, so let’s add a new policy bundle to our fleet that will flag up some more risky behaviour:</p>
<ul>
<li><p>From the <strong>Policy</strong> page, select the <strong>Settings</strong> tab.</p>
</li>
<li><p>Select <strong>Edit Fleet Settings</strong>.</p>
</li>
<li><p>Scroll down and select <strong>Customize Fleet Settings</strong>.</p>
</li>
<li><p>Look for <strong>NSA CISA Kubernetes hardening guide</strong> and select <strong>Enable</strong>.</p>
</li>
<li><p>Click <strong>Save Changes</strong>.</p>
</li>
<li><p>At the bottom of the page, click <strong>Configure</strong>, then <strong>Confirm</strong>.</p>
</li>
</ul>
<p>It can take a few minutes for new fleet settings to be synchronised with your cluster, but if you don’t want to wait for you can click the <strong>Sync with Fleet Settings</strong> button to start the process now. After a few minutes, the additional policy bundle will be synchronised with your cluster, and your dashboard should show that you are now compliant with the NSA Kubernetes standards.</p>
<p>Just like we have done many times now, let’s create a deployment for a Hello World application:</p>
<pre><code class="lang-bash">kubectl create deployment hello-server --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
</code></pre>
<p>You’ll notice that even though the deployment is created successfully, the admission controller has <em>mutated</em> our request. Admission controllers, in addition to validating and potentially declining a request, can also modify them to suit a set of policies. In this case, we did not specify CPU and memory resources for our container, but we have a policy that states that they are required. So, in this case the admission controller modified the request for us and added them.</p>
<p>Next, let’s create a Cluster IP service and an <code>Ingress</code> to map to it. Don’t worry that we don’t actually have an Ingress Controller to use right now, as we’re not really trying to set up a proper service. We just want to see what the Policy Controller will do.</p>
<pre><code class="lang-bash">kubectl expose deployment hello-server --<span class="hljs-built_in">type</span> ClusterIP --port 80 --target-port 8080
kubectl create ingress hello-ingress --rule=<span class="hljs-string">"example.com/=hello-server:80"</span>
</code></pre>
<p>If you investigated the details of the policy bundle before we applied it, you may be surprised that these commands returned successfully. The NSA policy bundle specifically contains a constraint called <code>nsa-cisa-k8s-v1.2-block-all-ingress</code>, so shouldn’t that previous command have failed?</p>
<p>As we mentioned earlier, policies are applied, by default, in a <em>dry-run</em> mode. If we go back to our Policy dashboard, we can see that the policies are active because we are now showing multiple violations, as shown in the screenshot below. Google’s recommended best practice is to review all policy violations in the dashboard before taking any automated corrective action. This makes sense from the point of view of changing security requirements in existing production environments; you don’t want your security team to make an arbitrary decision, update the fleet policy, and suddenly disable a bunch of production workloads!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727780312505/724d8698-8b1c-44f5-9a23-8c45da980636.png" alt class="image--center mx-auto" /></p>
<p>If you really want to, it is possible to change the enforcement mode of specific policies and policy bundles by patching the constraint objects directly in Kubernetes. Just look for the <code>enforcementAction</code> parameter and change it from <code>dryrun</code> to <code>warn</code> or <code>deny</code>. However, this approach is not recommended.</p>
<p>We’ve now seen how the Policy Controller can help us monitor the compliance of the clusters in our fleet to a set of security and best practices principles. Automating and visualizing this cluster intelligence makes the job of maintaining our fleet much easier. We can work with developers and teams to ensure that any violating workloads are fixed and rest easy knowing that our infrastructure is meeting some rigorous standards. These tools leverage the power of Kubernetes’ declarative approach to configuration and make the task of setting up clusters for developers easy and repeatable. Wouldn’t it be great if we could apply the same tooling to managing other resources in Google Cloud?</p>
<h2 id="heading-leveraging-the-config-controller">Leveraging the Config Controller</h2>
<p>The final tool that falls under the umbrella of GKE Config Management is the <strong>Config Controller</strong>. This service extends the declarative GitOps approach of GKE Enterprise and Config Sync to create and manage objects that are Google Cloud resources outside of your GKE clusters, such as storage buckets, Pub/Sub topics, or any other Google Cloud resource you need. Using Config Controller allows you to continue using your familiar Kubernetes-based approach to declaring your resources and gives you a consistent approach if you are already using Config Sync and GitOps processes. You can combine these tools to apply policy guardrails, auditing, and automatic drift detection and repair for all your deployments, inside of Kubernetes or anywhere else in Google Cloud.</p>
<h3 id="heading-choosing-the-right-tools">Choosing the right tools</h3>
<p>Before we jump in to look at an example of the Config Controller in action, let’s just take a step back and consider this tool in the context of other infrastructure as code tools. It’s fair to say that Terraform is the industry’s leading platform-agnostic tool for infrastructure as code. Most organisations with at least a moderate number of developers and a good understanding of DevOps will be using Terraform to deploy infrastructure, including virtualised infrastructure on cloud platforms. For Kubernetes users, this of course will include Kubernetes clusters, either configured manually on virtual machines or by using managed services like GKE, but again, declared and maintained via Terraform.</p>
<p>Once clusters are built, we move on to the right tooling to deploy workloads to those clusters. In this post, we’ve looked at some of the tools that are unique to GKE Enterprise, and later in the series I’m planning to look at other Google Cloud recommended CI/CD best practices for deploying and maintaining workloads. Many teams will continue to use Terraform to manage Kubernetes objects, but many may adopt a templating approach such as using Helm or Kustomize.</p>
<p>Given that these are industry norms (if not standards), how would you choose to deploy additional cloud resources external to Kubernetes at this point? Again, most teams use Terraform if this is their primary tool for declaring the state of their infrastructure. However, it might be that you don’t want your teams to have to context-switch between the Terraform configuration language and Kubernetes APIs. You may have additional tools in the mix, and any increase in complexity normally means a reduction in developer velocity or an increased chance of making mistakes.</p>
<p>So, in answer to this, Google Cloud created an open-source project called the <strong>GCP Config Connector</strong>, which declares a set of custom resource definitions for Google Cloud. Using this add-on for Kubernetes, anyone can now declare resources in Google Cloud using Kubernetes objects.</p>
<p>Now if you don’t want to create or maintain a Kubernetes cluster just to use the Config Connector, you can instead choose to use the <strong>Config Controller</strong>, which is a fully managed and hosted <em>service</em> for the Config Connector. For the rest of this post, we’ll continue to explore the Config Controller as a feature of GKE Enterprise, but bear in mind that similar outcomes can be achieved simply by adding the open-source Config Connector to an existing Kubernetes cluster. The names are very similar and can be confusing, but hopefully this has clarified things for you!</p>
<h3 id="heading-setting-up-the-config-controller">Setting up the Config Controller</h3>
<p>So for our final walkthrough/demo in this post, let’s create a Config Controller instance and use it to deploy a non-Kubernetes Google Cloud resource!</p>
<p>A Config Controller instance is a fully managed service, so we don’t need to configure or create the infrastructure required to run it, we just request the service with this command:</p>
<pre><code class="lang-bash">gcloud anthos config controller create cc-example --location=us-central1 --full-management
</code></pre>
<p>This command can take up to 15 minutes to complete because in the background a new GKE cluster is being provisioned just to run the Config Controller components for you. When the command completes it will also generate a <code>kubeconfig</code> entry for you, just like when you create a GKE cluster from the command line. This will allow you to use <code>kubectl</code> to create your Google Cloud resources.</p>
<p>Wait a minute, why are we running an <em>Anthos</em> command? That name got deprecated surely! Well at the time of writing, the commands we use for Config Controller still reference Anthos. While the product name is now officially deprecated, it can take a long time to rewrite command line tools to reflect this. If the commands don’t work in this section when you try them, please refer to the documentation here: <a target="_blank" href="https://cloud.google.com/sdk/gcloud/reference/anthos/config/controller">https://cloud.google.com/sdk/gcloud/reference/anthos/config/controller</a></p>
<p>Back to our demo, and when the instance has been created, it will show up as a GKE cluster in the cloud console cluster list. You can also confirm its state from the command line:</p>
<pre><code class="lang-bash">gcloud anthos config controller list --location=us-central1
</code></pre>
<p>Before we can ask our new Config Controller to create resources for us, we still need to give it the necessary IAM permissions in our project. As part of setting up the Config Controller instance, a service account for Config Controller has also been created. We will grant this service account the Project Owner IAM role so that it can have full control over project resources for us. Note that this is a rather “broad brush” approach. If you only want to use Config Controller for a subset of your project resources, you may want to consider a different IAM role.</p>
<p>First we’ll get the value of the service account email address and store it in an environment variable that we’ll use in a moment:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> SA_EMAIL=<span class="hljs-string">"<span class="hljs-subst">$(kubectl get ConfigConnectorContext -n config-control -o jsonpath='{.items[0].spec.googleServiceAccount}' 2&gt; /dev/null)</span>"</span>
</code></pre>
<p>Now we’ll use that service account email to add an IAM policy binding. In the following command, replace <code>my-project-id</code> with the ID of your own Google Cloud project:</p>
<pre><code class="lang-bash">gcloud projects add-iam-policy-binding my-project-id \
  --member <span class="hljs-string">"serviceAccount:<span class="hljs-variable">${SA_EMAIL}</span>"</span> \
  --role <span class="hljs-string">"roles/owner"</span> \
  --project my-project-id
</code></pre>
<p>Our Config Controller instance is now authorised to manage resources in our project, and we can go ahead and start using it! Google Cloud resources are created by declaring them as Kubernetes objects, in much the same way as if we were creating Kubernetes resources. We can do this simply by writing the necessary YAML files, or of course using Config Sync or other GitOps-based options.</p>
<p>As a simple test, let’s say we want to enable the Pub/Sub API in our project and create a Pub/Sub topic. If you’re not familiar with Pub/Sub, it’s a serverless distributed messaging service, often used in big data pipelines. We’re not going to do anything with Pub/Sub right now, however, other than to demonstrate how we can set it up with the Config Controller.</p>
<p>So first, we’ll create a YAML declaration that states that the Pub/Sub API should be enabled. We’ll create a file called <code>enable-pubsub.yaml</code> and copy in the code below (make sure to once again replace any occurrences of <code>my-project-id</code> with your own Google Cloud project ID if you’re following along):</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">serviceusage.cnrm.cloud.google.com/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pubsub.googleapis.com</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">config-control</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">projectRef:</span>
    <span class="hljs-attr">external:</span> <span class="hljs-string">projects/my-project-id</span>
</code></pre>
<p>You can now apply the YAML declaration with <code>kubectl apply</code>, just as you would with any other Kubernetes object. This creates an object of the type <code>serviceusage.cnrm.cloud.google.com</code> in the <code>config-control</code> namespace on the Config Connector instance. This object represents the resource configuration you have requested in your project; namely, that the Pub/Sub API should be enabled. If you query the object, you should see that the object’s state represents whether the desired resource has been created, as shown below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727787554129/ea386e4c-9e8b-4839-8c42-e01b439e6f51.png" alt class="image--center mx-auto" /></p>
<p>Next, let’s create a Pub/Sub topic by writing a new YAML file called <code>pubsub- topic.yaml</code>. Don’t forget to substitute your own project ID again:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">pubsub.cnrm.cloud.google.com/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PubSubTopic</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">cnrm.cloud.google.com/project-id:</span> <span class="hljs-string">my-project-id</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">example-topic</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">config-control</span>
</code></pre>
<p>Once again you can query the configuration object with <code>kubectl</code>. To confirm the resource has been created, we can also use <code>gcloud</code> to list all Pub/Sub topics:</p>
<pre><code class="lang-bash">gcloud pubsub topics list
</code></pre>
<p>You’ll notice that your topic contains a label identifying that it is managed by the Config Controller.</p>
<h3 id="heading-combining-policy-controller-and-config-controller">Combining Policy Controller and Config Controller</h3>
<p>The Config Controller instance also provides us with a built-in Policy Controller, although at the time of writing only a handful of non-Kubernetes, general Google Cloud resource constraints were available. One example is the <code>GCPStorageLocationConstraintV1</code> constraint, which can be used to restrict the allowed locations for Cloud Storage buckets. Outside of the supported templates, other non-Kubernetes constraints could theoretically be created by applying constraint logic to the configuration objects used by the Config Controller, but this isn’t a solution that’s officially supported by Google.</p>
<p>We can demonstrate a constraint that would restrict all Cloud Storage buckets to being created in the <code>us-central1</code> region by creating a <code>bucket-constraint.yaml</code> file like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">constraints.gatekeeper.sh/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">GCPStorageLocationConstraintV1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">storage-only-in-us-central1</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">match:</span>
    <span class="hljs-attr">kinds:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">apiGroups:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">storage.cnrm.cloud.google.com</span>
      <span class="hljs-attr">kinds:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">StorageBucket</span>
  <span class="hljs-attr">parameters:</span>
    <span class="hljs-attr">locations:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">us-central1</span>
</code></pre>
<p>Once we apply this object to the cluster, we can try to create a definition for a bucket in a location that should not be allowed:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">storage.cnrm.cloud.google.com/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">StorageBucket</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-bucket</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">config-control</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">location:</span> <span class="hljs-string">asia-southeast1</span>
</code></pre>
<p>When we try to apply this configuration, we should hopefully receive an error from the API server. The Gatekeeper admission controller has denied the request, because the location was disallowed!</p>
<p>To fully leverage the Config Controller, you may also wish to combine it with Config Sync, allowing you to benefit from all the features of Config Sync we discussed earlier in this post. Conceptually, in addition to applying Config Sync to clusters in our fleet, we also apply it to the Config Controller instance, allowing us to use centralised repositories for configuration and policy for Kubernetes objects and Google Cloud resources.</p>
<p>To set this up, all we need to do is manually create a <code>RootSync</code> object and apply it to the Config Controller instance. A sample YAML declaration might look like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">configsync.gke.io/v1beta1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">RootSync</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">root-sync</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">config-management-system</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">sourceFormat:</span> <span class="hljs-string">unstructured</span>
  <span class="hljs-attr">git:</span>
    <span class="hljs-attr">repo:</span> <span class="hljs-string">https://github.com/mygithub/myrepo</span>
    <span class="hljs-attr">branch:</span> <span class="hljs-string">main</span>
    <span class="hljs-attr">dir:</span> <span class="hljs-string">config-dir</span>
    <span class="hljs-attr">auth:</span> <span class="hljs-string">none</span>
</code></pre>
<p>It’s worth keeping an eye on how Google develops the constraints available for other cloud resources. If there is broad enough support, it could help reduce the overall number of tools that you end up using to secure your environments. Fewer tools means fewer things to go wrong!</p>
<h2 id="heading-summary">Summary</h2>
<p>In this post, I’ve introduced the three main tools used by GKE Enterprise for configuration management: <strong>Config Sync</strong>, <strong>Policy Controller</strong> and <strong>Config Controller</strong>. I started by making the case for declarative configuration management and learned how this could be applied with the modern approach of GitOps. Each of these tools performs a specific function, and I would recommend that you consider which of them is relevant to you and your organisation, and in what combination.</p>
<p>You don’t have to use all of them, or <em>any</em> of them if they don’t solve a relevant problem for you. Of the three tools, Config Controller is probably the least well-known because of the prevalence of Terraform for managing infrastructure, and Helm for managing Kubernetes-based workloads. As always, you should experiment to find the right fit for your team based on your own skillsets and other tools you may be using. Whichever tools you choose, declarative infrastructure as code, a single source of truth, and the automation of drift-reconciliation are recommended best practices.</p>
<p>So what’s next in this series? Exploring some of the fundamentals of GKE Enterprise has meant discussing some of the most niche topics and products I’ve ever dabbled in! But next time I’ll get into some more practical uses cases, looking at design patterns for multiple clusters and concepts like north-south and east-west routing. After <em>that</em>, we’ll be ready to embark on an epic Service Mesh quest!</p>
]]></content:encoded></item><item><title><![CDATA[Deploying GKE on Bare Metal and VMWare]]></title><description><![CDATA[This is the third post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to supp...]]></description><link>https://timberry.dev/deploying-gke-on-bare-metal-and-vmware</link><guid isPermaLink="true">https://timberry.dev/deploying-gke-on-bare-metal-and-vmware</guid><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[vmware]]></category><category><![CDATA[baremetal]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Tue, 24 Sep 2024 10:45:45 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729609618828/517f3658-38e4-43f5-8066-80b740756cf2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the third post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><strong><em>start there</em></strong></a>.</p>
<p>In my last post of this series I looked at the most common use cases for deploying GKE - running it in the cloud (probably Google Cloud most of the time). Going all in with a single cloud provider was popular in the early boom days of the cloud, but it's now quite common for large enterprises to diversify their IT infrastructure. You may have a requirement to use multiple clouds or, as I’ll cover in this post, your own dedicated physical infrastructure. Use cases for keeping application deployments on-premises or your own managed infrastructure vary but can often be tied to regulatory compliance or the need to guarantee data locality. By using GKE Enterprise, you can keep the same approach to modernisation, the same processes and tools for deployment, and the same level of management and visibility, but have your choice of deployment targets. In this post, I’ll cover how GKE Enterprise enables this by deploying to bare metal, and other customer-managed infrastructure services that may be virtualised (GKE still considers them bare metal if there isn’t a specific cloud solution for them).</p>
<p>I’m going to walk through a demonstration of some principles of bare metal GKE deployments, but I’m going to simulate a physical datacenter using virtual machines in Google’s Compute Engine. There are simply too many different configurations for physical infrastructure for me to try to cover accurately in a blog post, but the concepts should translate to whatever you need to build in your own environments. As I pointed out in previous posts, if you want to try any of this out for yourself, please be mindful of the costs!</p>
<h2 id="heading-building-gke-on-bare-metal-environments">Building GKE on Bare Metal Environments</h2>
<p>GKE Enterprise refers to <em>bare metal</em> as anything running outside of Google Cloud or other supported cloud vendors, but it’s probably not the best way to describe these non-cloud environments, and indeed Google’s own terminology may be updated in due course to reflect this.</p>
<p>Running GKE Enterprise in this way relies on a solution called <strong>Google Distributed Cloud Virtual</strong> (GDCV), which is part of the broader <strong>Google Distributed Cloud</strong> (GDC) service that extends Google Cloud services into alternative data centres or edge appliances. There are even options for GDC that allow you to run a completely isolated and “air-gapped” instance of Google Cloud services on your own hardware! For the purposes of running GKE on bare metal though, we’ll use GDCV to provide connectivity between Google Cloud and other locations such as on-premises data centres.</p>
<p>While many organisations will adopt a cloud-only approach, even if that involves multiple cloud vendors, some organisations may not be ready to <em>fully</em> move to the cloud, or they may have compelling reasons to not do so. This could be due to long-term infrastructure investments in their own data centres, or requirements to work with other types of physical infrastructure. For example, consider industries that work with factories, stadiums, or even oil rigs. Specialised physical infrastructure that is simply not available in the cloud such as bespoke appliances or even mainframes may be in use by some enterprises. An interesting outlier is gaming and gambling companies that are required to generate random numbers in a location that is mandated by law.</p>
<h3 id="heading-hybrid-use-cases">Hybrid use-cases</h3>
<p>Hybrid computing opens up a world of possibilities for organisations with existing on-premises application deployments, or investments in on-premises infrastructure. Some of these use-cases might include:</p>
<ul>
<li><p>Developing and testing applications in a cost-controlled environment on-premises before deploying to a production environment in the cloud</p>
</li>
<li><p>Building front-end and new services in the cloud while keeping enterprise software such as ERP systems on-premises</p>
</li>
<li><p>Storing or processing data in a specific location or under specific controls due to working in a regulated industry</p>
</li>
<li><p>Deploying services to customers in specific areas where no cloud resources are available</p>
</li>
</ul>
<p>Even if you’re deploying to different environments, keeping the process of deploying and managing applications the same will provide developer efficiency, centralised control, and better security. Thankfully, GKE clusters that run in bare metal environments still have most of the capabilities of cloud-based clusters, including integrating with GKE’s management and operations tools, configuration management, and service mesh.</p>
<h3 id="heading-running-gke-on-your-own-servers">Running GKE on your own servers</h3>
<p>Using your own servers – physical or virtual – gives you complete control over the design of your infrastructure, from how your nodes are built to your networking and security configuration. GKE on Bare Metal installs GKE clusters on your own Linux servers and registers them as managed clusters to your GKE Enterprise fleet. And remember, GKE doesn’t care if these really are bare metal, or Linux servers running in a different virtualisation environment.</p>
<p>So, how does this work? With bare metal, there are a few more moving parts than with cloud-based GKE clusters:</p>
<ul>
<li><p><strong>User clusters</strong>: A Kubernetes cluster where workloads are deployed is now called a <em>user cluster</em>. A user cluster consists of at least one control plane node and one worker node. You can have multiple user clusters, which can operate on physical or virtual machines running the Linux operating system.</p>
</li>
<li><p><strong>Admin cluster</strong>: For GKE Bare Metal we also use something called an <em>admin cluster</em>. The admin cluster is a separate Kubernetes control plane that is responsible for the lifecycle of one or more user clusters, and it can create, upgrade, update or delete user clusters by controlling the software that runs on those machines. The admin cluster can also operate on physical or virtual machines running Linux.</p>
</li>
<li><p><strong>Admin workstation</strong>: It is recommended to create a separate machine (virtual or physical) to act as an admin workstation, which will contain all the necessary tools and configuration files to manage the clusters. Just like with regular Kubernetes clusters, you’ll use <code>kubectl</code> as one of these tools to manage workloads on your admin and user clusters. In addition, you’ll use a GKE-specific tool called <code>bmctl</code> to create, update and administer your clusters.</p>
</li>
</ul>
<p>These are the components that differentiate GKE on Bare Metal from its cloud-based counterparts, but there are many different configurations of clusters to choose from based on your requirements for high availability and resource isolation.</p>
<h3 id="heading-choosing-a-deployment-pattern">Choosing a deployment pattern</h3>
<p>High availability (HA) is recommended for production environments, and it can be achieved by using at least three control plane nodes per cluster (including the admin and user clusters), which we can see illustrated below. This configuration is also recommended when you want to manage multiple clusters in the same datacenter from a single centralised place. Using multiple user clusters allows you to isolate workloads between different environments and teams while keeping a centralised control plane to manage all clusters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1726940679007/7b6b02ea-3469-43d0-a234-860d79362c04.png" alt class="image--center mx-auto" /></p>
<p>An alternative to this configuration is a hybrid cluster deployment as shown below. In this pattern, admin clusters serve a dual purpose and may also host user workloads in addition to managing user clusters. This can help to reclaim unused capacity and resources from admin clusters that may otherwise be over-provisioned (particularly as in a physical world, as you may have less flexibility around machine size). However, there are risks to this approach, as the admin cluster holds sensitive data such as SSH and service account keys. If a user workload exploits a vulnerability to access other nodes in the cluster, this data could be compromised.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1726940714369/03ad944b-90a0-4734-afcb-f2c2977c19f0.png" alt class="image--center mx-auto" /></p>
<p>Finally, it is actually possible to run everything you need on a single cluster, essentially taking the hybrid admin cluster from above and removing the separate admin clusters. This can still be HA by having multiple nodes, although everything now runs within a single cluster context which can have the same security risks as the hybrid cluster. The advantage of the standalone cluster, as illustrated below, is that there are significantly fewer resource requirements, which can be helpful in constrained environments or other deployments at the edge.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1726940769490/22b01b08-2259-44ef-a697-faa1ef98f469.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-network-requirements">Network requirements</h3>
<p>The networking and connectivity requirements for GKE on Bare Metal can be quite complex and extensive, depending on how your local physical networks are configured, or how you design networks in your own datacenter. I can’t cover all the possible different ways to provide networking for GKE in this post - it would never end! But at a high level you need to consider 3 primary requirements:</p>
<ul>
<li><p>Layer 2 and layer 3 networking for cluster nodes</p>
</li>
<li><p>Connectivity to your Google Cloud VPC</p>
</li>
<li><p>How external users will access your workloads</p>
</li>
</ul>
<p>As we know, GKE on Bare Metal runs on physical or virtual machines that you provide; in other words, these machines already exist and will not be provisioned by GKE. They will therefore already be connected to your own network. GKE on Bare Metal supports layer 2 or layer 3 connectivity between cluster nodes. Remember from the OSI model that layer 2 refers to the data link layer, with physical network cards communicating across a single network segment, usually via ethernet connections. Layer 3 is the network layer, which uses the Internet Protocol (IP) to traverse multiple network segments.</p>
<p>There are lots of pros and cons for both approaches, but your local network design will usually determine which option you use. At a high level, layer 2 is sometimes more suitable for smaller clusters with a focus on simplicity or a requirement to bundle the MetalLB load balancer (more on that in a moment!), while layer 3 is recommended for larger deployments where scalability and security are paramount.</p>
<p>Your Kubernetes nodes will each require their own CIDR block so that they can assign unique IP addresses to Pods, and the size of that block will affect the number of Pods that can be scheduled. The minimum size of the CIDR block is a <code>/26</code> which allows for 64 IP addresses and 32 Pods per node. The maximum CIDR block size is a <code>/23</code> which allows for 512 IP addresses and the maximum of 250 Pods per node. Again, your local network design will impact this configuration.</p>
<p>Your local nodes will require connectivity to Google Cloud so that they can communicate with the Connect Agent that will facilitate communication between the GKE service and your local Kubernetes API server. You can set up this connectivity either across the public Internet using HTTPS, using a VPN connection or a Dedicated Interconnect. Google has a helpful article on choosing a network connectivity product here: <a target="_blank" href="https://cloud.google.com/network-connectivity/docs/how-to/choose-product">https://cloud.google.com/network-connectivity/docs/how-to/choose-product</a></p>
<p>Finally, you will need to decide how your external users will connect to your applications hosted on bare metal nodes, if indeed that is a requirement of your clusters. You have a few different options to achieve this, which once again may depend on your local datacenter configuration and your individual requirements.</p>
<p>GKE on Bare Metal can bundle two types of load balancers as part of its deployment. The popular MetalLB bare metal load balancer can be completely managed by GKE for you, providing your cluster nodes run on a layer 2 network. MetalLB load balancers will run on either the control plane nodes of your cluster, or a subset of your work nodes, and provide virtual IPs that are routable through your layer 2 subnet gateway. Services configured with the <code>LoadBalancer</code> type will then be provided with a virtual IP.</p>
<p>Alternatively, GKE on Bare Metal supports a bundled load balancer that uses Border Gateway Protocol (BGP) for nodes on layer 3 networks. This approach provides much more flexibility, but it can be complex to configure. Once again Google provides a complete guide here: <a target="_blank" href="https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/lb-bundled-bgp">https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/how-to/lb-bundled-bgp</a> but I recommend seeking the help of some experienced network engineers to set this up.</p>
<p>Bundled load balancers are just one way to go, however! If your datacenter already runs a load balancing service, such as those provided by F5, Citrix or Cisco, these can also be leveraged. It’s even possible to use cloud load balancers with bare metal servers, however for this to work you must run the ingress service on a cloud-hosted cluster, which then redirects traffic to the on-premises cluster using a multi-cluster ingress. And yes, I’m going to write a post about ingress very soon!</p>
<h3 id="heading-building-some-bare-metal-infrastructure">Building some bare metal infrastructure</h3>
<p>With the network design out of the way, it’s time to start building our clusters. As mentioned, we’ll be simulating a bare metal environment using Compute Engine just so we can demonstrate the capabilities of GKE on Bare Metal. The tasks we’ll carry out to set up GKE will be almost identical in a real on-premises environment, you’d just be building the servers for real, or using some other kind of infrastructure service. Just like in my last post, you can find the commands below collected into scripts in my Github repo at <a target="_blank" href="https://github.com/timhberry/gke-enterprise/tree/main/bare-metal-demo-scripts">https://github.com/timhberry/gke-enterprise/tree/main/bare-metal-demo-scripts</a></p>
<p>For reference, the scripts are:</p>
<ul>
<li><p><code>bm-vpc.sh</code> - Creates a simulated bare metal network using a Google Cloud VPC</p>
</li>
<li><p><code>create-servers.sh</code> - Creates and configures multiple virtual machines to act as our bare metal infrastructure</p>
</li>
<li><p><code>admin-ws-setup.sh</code> - Configures the admin workstation</p>
</li>
<li><p><code>admin-cluster.sh</code> - Configures the admin cluster (and must be run from the admin workstation)</p>
</li>
<li><p><code>user-cluster.sh</code> - Configures the user cluster (and must be run from the admin workstation)</p>
</li>
</ul>
<p><strong>One final caveat!</strong> I just want to reiterate, that below I’m showing you how to build a demo environment using virtual machines in Compute Engine just so we can understand the tooling and the necessary steps to do this on bare metal. So again you’ll need a Google Cloud project with billing enabled, plus the <code>gcloud</code> tool installed and configured. Please be mindful of the costs you’re about to run up, and <em>don’t run GKE for Bare Metal on Compute Engine in production!</em></p>
<p>Okay, let’s start by creating a VPC for our simulated bare metal environment. We’ll call it <code>baremetal</code> and specify that we’re going to use custom, not auto-assigned subnets. These commands can all be found in the <code>bm-vpc.sh</code> script:</p>
<pre><code class="lang-bash">gcloud compute networks create baremetal \
  --subnet-mode=custom \
  --mtu=1460 \
  --bgp-routing-mode=regional
</code></pre>
<p>Next we’ll create a subnetwork in the <code>us-central1</code> region with a CIDR block of <code>10.1.0.0/24</code>:</p>
<pre><code class="lang-bash">gcloud compute networks subnets create us-central1-subnet \
  --range=10.1.0.0/24 \
  --stack-type=IPV4_ONLY \
  --network=baremetal \
  --region=us-central1
</code></pre>
<p>We’ll create some firewall rules now. First, a rule to allow incoming SSH connections from Google’s Identity Aware Proxy (IAP):</p>
<pre><code class="lang-bash">gcloud compute firewall-rules create iap \
  --direction=INGRESS \
  --priority=1000 \
  --network=baremetal \
  --action=ALLOW \
  --rules=tcp:22 \
  --source-ranges=35.235.240.0/20
</code></pre>
<p>And then a rule to enable VXLAN traffic for our nodes:</p>
<pre><code class="lang-bash">gcloud compute firewall-rules create vxlan \
  --direction=INGRESS \
  --priority=1000 \
  --network=baremetal \
  --action=ALLOW \
  --rules=udp:4789 \
  --source-tags=vxlan
</code></pre>
<p>What is this mysterious VXLAN we speak of? Well, even though Google Cloud uses a software-defined network (SDN), we can still simulate a layer 2 network for our bare metal demonstration by using Virtual Extensible LAN (VXLAN) which encapsulates ethernet frames on the layer 3 network. If we were deploying servers on a real layer 2 network, we would not need to perform this step.</p>
<p><em>A quick side note: In my tests, the VXLAN package and Google Cloud’s SDN did not always play nice. Sometimes these firewall rules were sufficient, sometimes they weren’t. If you get stuck, you can create an “allow all” rule for your pretend bare metal VPC. This is obviously not a secure solution, but this is just for demonstration purposes anyway!</em></p>
<p>Getting back to our demo, we need to create a service account that we’ll use later for our admin workstation:</p>
<pre><code class="lang-bash">PROJECT_ID=$(gcloud config get-value project)
gcloud iam service-accounts create bm-owner
gcloud projects add-iam-policy-binding <span class="hljs-variable">${PROJECT_ID}</span> \
  --member=serviceAccount:bm-owner@<span class="hljs-variable">${PROJECT_ID}</span>.iam.gserviceaccount.com \
  --role=roles/owner
</code></pre>
<p>Time to create the nodes for our clusters! In this demonstration, we’ll build an admin cluster with a control plane but no worker nodes, and a user cluster with a control plane and one worker node. We won’t go for high availability as this is just a demo, but you can theorise how we could add high availability to this environment by increasing the number of nodes at each level. How you ensure there are no single failure points between multiple nodes depends heavily on how your physical infrastructure is built. These commands can all be found in the <code>create-servers.sh</code> script.</p>
<p>First we’ll create a bash array to hold our server names, and an empty array that will store IP address for us:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">declare</span> -a VMs=(<span class="hljs-string">"admin-workstation"</span> <span class="hljs-string">"admin-control"</span> <span class="hljs-string">"user-control"</span> <span class="hljs-string">"user-worker"</span>)
<span class="hljs-built_in">declare</span> -a IPs=()
</code></pre>
<p>Now we’ll run a <code>for</code> loop to actually create these servers. To keep things simply, each server is identical. Every time we create a VM, we grab its internal IP and add it to the array:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> vm <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${VMs[@]}</span>"</span>
 <span class="hljs-keyword">do</span>
     gcloud compute instances create <span class="hljs-variable">$vm</span> \
         --image-family=ubuntu-2004-lts \
         --image-project=ubuntu-os-cloud \
         --zone=us-central1<span class="hljs-_">-a</span> \
         --boot-disk-size 128G \
         --boot-disk-type pd-standard \
         --can-ip-forward \
         --network baremetal \
         --subnet us-central1-subnet \
         --scopes cloud-platform \
         --machine-type e2-standard-4 \
         --metadata=os-login=FALSE \
         --verbosity=error
     IP=$(gcloud compute instances describe <span class="hljs-variable">$vm</span> --zone us-central1<span class="hljs-_">-a</span> \
         --format=<span class="hljs-string">'get(networkInterfaces[0].networkIP)'</span>)
     IPs+=(<span class="hljs-string">"<span class="hljs-variable">$IP</span>"</span>)
<span class="hljs-keyword">done</span>
</code></pre>
<p>Now we need to add some network tags to each instance which will be used by firewall rules we will set up in a moment. Tags are also used to identify which cluster a node belongs to, whether it provides a control plane or a worker, and additionally whether it will double up as a load balancer (which the control planes do). All servers are tagged with the <code>vxlan</code> tag, as we’re about to set up the VXLAN functionality:</p>
<pre><code class="lang-bash">gcloud compute instances add-tags admin-control \
  --zone us-central1<span class="hljs-_">-a</span> \
  --tags=<span class="hljs-string">"cp,admin,lb,vxlan"</span>
gcloud compute instances add-tags user-control \
  --zone us-central1<span class="hljs-_">-a</span> \
  --tags=<span class="hljs-string">"cp,user,lb,vxlan"</span>
gcloud compute instances add-tags user-worker \
  --zone us-central1<span class="hljs-_">-a</span> \
  --tags=<span class="hljs-string">"worker,user,vxlan"</span>
</code></pre>
<p>For VXLAN to work, we also need to disable the default Ubuntu firewall on each server:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> vm <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${VMs[@]}</span>"</span>
<span class="hljs-keyword">do</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Disabling UFW on <span class="hljs-variable">$vm</span>"</span>
    gcloud compute ssh root@<span class="hljs-variable">$vm</span> --zone us-central1<span class="hljs-_">-a</span> --tunnel-through-iap  &lt;&lt; EOF
        sudo ufw <span class="hljs-built_in">disable</span>
EOF
<span class="hljs-keyword">done</span>
</code></pre>
<p>That <code>for</code> loop looked really simple didn’t it? Great, because the next one is a lot more complicated 😩 We’ll need to loop through all the VMs and create a VXLAN configuration locally. Doing this assigns an IP address in the <code>10.200.0.x</code> range for the encapsulated layer 2 network:</p>
<pre><code class="lang-bash">i=2
<span class="hljs-keyword">for</span> vm <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${VMs[@]}</span>"</span>
<span class="hljs-keyword">do</span>
    gcloud compute ssh root@<span class="hljs-variable">$vm</span> --zone us-central1<span class="hljs-_">-a</span> --tunnel-through-iap &lt;&lt; EOF
        <span class="hljs-comment"># update package list on VM</span>
        apt-get -qq update &gt; /dev/null
        apt-get -qq install -y jq &gt; /dev/null

        <span class="hljs-comment"># print executed commands to terminal</span>
        <span class="hljs-built_in">set</span> -x

        <span class="hljs-comment"># create new vxlan configuration</span>
        ip link add vxlan0 <span class="hljs-built_in">type</span> vxlan id 42 dev ens4 dstport 4789
        current_ip=\$(ip --json a show dev ens4 | jq <span class="hljs-string">'.[0].addr_info[0].local'</span> -r)
        <span class="hljs-built_in">echo</span> <span class="hljs-string">"VM IP address is: \$current_ip"</span>
        <span class="hljs-keyword">for</span> ip <span class="hljs-keyword">in</span> <span class="hljs-variable">${IPs[@]}</span>; <span class="hljs-keyword">do</span>
            <span class="hljs-keyword">if</span> [ <span class="hljs-string">"\$ip"</span> != <span class="hljs-string">"\$current_ip"</span> ]; <span class="hljs-keyword">then</span>
                bridge fdb append to 00:00:00:00:00:00 dst \<span class="hljs-variable">$ip</span> dev vxlan0
            <span class="hljs-keyword">fi</span>
        <span class="hljs-keyword">done</span>
        ip addr add 10.200.0.<span class="hljs-variable">$i</span>/24 dev vxlan0
        ip link <span class="hljs-built_in">set</span> up dev vxlan0
EOF
    i=$((i+<span class="hljs-number">1</span>))
<span class="hljs-keyword">done</span>
</code></pre>
<p>Once that’s done, we’ll loop through the VMs again and make sure the VXLAN IPs are working:</p>
<pre><code class="lang-bash">i=2
<span class="hljs-keyword">for</span> vm <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${VMs[@]}</span>"</span>;
<span class="hljs-keyword">do</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-variable">$vm</span>;
    gcloud compute ssh root@<span class="hljs-variable">$vm</span> --zone us-central1<span class="hljs-_">-a</span> --tunnel-through-iap --<span class="hljs-built_in">command</span>=<span class="hljs-string">"hostname -I"</span>; 
    i=$((i+<span class="hljs-number">1</span>));
<span class="hljs-keyword">done</span>
</code></pre>
<p>The final part of setting up our simulated physical environment is to create the necessary firewall rules that enable traffic to our control planes and worker nodes, as well as inbound traffic to the load balancer nodes and traffic between clusters:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Add firewall rule to allow traffic to the control plane</span>
gcloud compute firewall-rules create bm-allow-cp \
    --network=<span class="hljs-string">"baremetal"</span> \
    --allow=<span class="hljs-string">"UDP:6081,TCP:22,TCP:6444,TCP:2379-2380,TCP:10250-10252,TCP:4240"</span> \
    --source-ranges=<span class="hljs-string">"10.0.0.0/8"</span> \
    --target-tags=<span class="hljs-string">"cp"</span>

<span class="hljs-comment"># Add firewal rule to allow inbound traffic to worker nodes</span>
gcloud compute firewall-rules create bm-allow-worker \
    --network=<span class="hljs-string">"baremetal"</span> \
    --allow=<span class="hljs-string">"UDP:6081,TCP:22,TCP:10250,TCP:30000-32767,TCP:4240"</span> \
    --source-ranges=<span class="hljs-string">"10.0.0.0/8"</span> \
    --target-tags=<span class="hljs-string">"worker"</span>

<span class="hljs-comment"># Add firewall rule to allow inbound traffic to load balancer nodes</span>
gcloud compute firewall-rules create bm-allow-lb \
    --network=<span class="hljs-string">"baremetal"</span> \
    --allow=<span class="hljs-string">"UDP:6081,TCP:22,TCP:443,TCP:7946,UDP:7496,TCP:4240"</span> \
    --source-ranges=<span class="hljs-string">"10.0.0.0/8"</span> \
    --target-tags=<span class="hljs-string">"lb"</span>

gcloud compute firewall-rules create allow-gfe-to-lb \
    --network=<span class="hljs-string">"baremetal"</span> \
    --allow=<span class="hljs-string">"TCP:443"</span> \
    --source-ranges=<span class="hljs-string">"10.0.0.0/8,130.211.0.0/22,35.191.0.0/16"</span> \
    --target-tags=<span class="hljs-string">"lb"</span>

<span class="hljs-comment"># Add firewall rule to allow traffic between admin and user clusters</span>
gcloud compute firewall-rules create bm-allow-multi \
    --network=<span class="hljs-string">"baremetal"</span> \
    --allow=<span class="hljs-string">"TCP:22,TCP:443"</span> \
    --source-tags=<span class="hljs-string">"admin"</span> \
    --target-tags=<span class="hljs-string">"user"</span>
</code></pre>
<p>We’ve now got a completely simulated physical environment, ready for us to set up.</p>
<h3 id="heading-configuring-the-admin-workstation">Configuring the admin workstation</h3>
<p>The purpose of the admin workstation is to hold the tools and configuration we need to set up and manage our admin and user clusters. We have already created the server itself, but now it’s time to set it up. These commands can be found in the <code>admin-ws-setup.sh</code> script, but if you’re following along, make sure you run all of the commands on <em>the admin workstation itself</em> once you have connected to it.</p>
<p>So first, let’s connect to the workstation using IAP:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">eval</span> `ssh-agent`
ssh-add ~/.ssh/google_compute_engine
gcloud compute ssh --ssh-flag=<span class="hljs-string">"-A"</span> root@admin-workstation \
  --zone us-central1<span class="hljs-_">-a</span> \
  --tunnel-through-iap
</code></pre>
<p>Like I said above, everything else for this section should run on the workstation VM itself (not your local machine, or the Cloud Shell terminal for example). Make sure your prompt looks like this:</p>
<pre><code class="lang-bash">root@admin-workstation:~<span class="hljs-comment">#</span>
</code></pre>
<p>Okay, first we need to remove the preinstalled snap version of the <code>gcloud</code> SDK, and then install the latest version:</p>
<pre><code class="lang-bash">snap remove google-cloud-cli
curl https://sdk.cloud.google.com | bash
<span class="hljs-built_in">exec</span> -l <span class="hljs-variable">$SHELL</span>
</code></pre>
<p>Now we’ll use <code>gcloud</code> to install <code>kubectl</code>:</p>
<pre><code class="lang-bash">gcloud components install kubectl
</code></pre>
<p>We download Google’s <code>bmctl</code> tool, which simplifies cluster management for bare metal servers, and we install it into <code>/usr/local/sbin</code>:</p>
<pre><code class="lang-bash">gsutil cp gs://anthos-baremetal-release/bmctl/1.16.0/linux-amd64/bmctl .
chmod a+x bmctl
mv bmctl /usr/<span class="hljs-built_in">local</span>/sbin/
bmctl version
</code></pre>
<p>We also download Docker and install it. Docker will be used to run local containers as part of the installation process for admin and user clusters:</p>
<pre><code class="lang-bash">curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
docker version
</code></pre>
<p>We need to set up an SSH key so that the admin workstation will be able to connect without passwords to each of our nodes. Once again we’ll leverage a bash loops and arrays to copy our SSH public key to each server:</p>
<pre><code class="lang-bash">ssh-keygen -t rsa

<span class="hljs-comment"># Create a bash array containing our server names</span>
<span class="hljs-built_in">declare</span> -a VMs=(<span class="hljs-string">"admin-control"</span> <span class="hljs-string">"user-control"</span> <span class="hljs-string">"user-worker"</span>)

<span class="hljs-comment"># Copy our SSH public key to enable password-less access</span>
<span class="hljs-keyword">for</span> vm <span class="hljs-keyword">in</span> <span class="hljs-string">"<span class="hljs-variable">${VMs[@]}</span>"</span>
<span class="hljs-keyword">do</span>
    ssh-copy-id -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa.pub root@<span class="hljs-variable">$vm</span>
<span class="hljs-keyword">done</span>
</code></pre>
<p>Finally, we install <code>kubectx</code>. This tool allows us to quickly switch contexts between different Kubernetes clusters:</p>
<pre><code class="lang-bash">git <span class="hljs-built_in">clone</span> https://github.com/ahmetb/kubectx /opt/kubectx
ln -s /opt/kubectx/kubectx /usr/<span class="hljs-built_in">local</span>/bin/kubectx
ln -s /opt/kubectx/kubens /usr/<span class="hljs-built_in">local</span>/bin/kubens
</code></pre>
<p>That’s it! Our admin workstation is now ready to help us build the rest of our bare metal infrastructure.</p>
<h3 id="heading-creating-the-admin-cluster">Creating the admin cluster</h3>
<p>We will now create the admin cluster. In our demo environment, our admin cluster will contain a single control plane node, but no worker nodes, and this single node will also double up as a load balancer. To create the cluster, we’ll need to set up some services and service accounts, then create a configuration file that our admin workstation will use to configure our admin cluster node and turn it into a functioning Kubernetes cluster (albeit a cluster of one!). These commands can be found in the <code>admin-cluster.sh</code> script.</p>
<p>First, we set some environment variables for our project, zone, SSH key and load balancer IP addresses:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> PROJECT_ID=$(gcloud config get-value project)
<span class="hljs-built_in">export</span> ZONE=us-central1<span class="hljs-_">-a</span>
<span class="hljs-built_in">export</span> SSH_PRIVATE_KEY=/root/.ssh/id_rsa
<span class="hljs-built_in">export</span> LB_CONTROLL_PLANE_NODE=10.200.0.3
<span class="hljs-built_in">export</span> LB_CONTROLL_PLANE_VIP=10.200.0.98
</code></pre>
<p>Next, we create a key for the service account we created earlier, and make sure we’re using that to authenticate with Google Cloud APIs:</p>
<pre><code class="lang-bash">gcloud iam service-accounts keys create installer.json \
  --iam-account=bm-owner@<span class="hljs-variable">$PROJECT_ID</span>.iam.gserviceaccount.com
<span class="hljs-built_in">export</span> GOOGLE_APPLICATION_CREDENTIALS=~/installer.json
</code></pre>
<p>Then we create a config file with the <code>bmctl</code> command. This creates a draft configuration file which we’ll edit in a moment, but it also enables all the APIs we need in our project. The config file is a simple YAML file that tells <code>bmctl</code> how to set up our admin cluster:</p>
<pre><code class="lang-bash">bmctl create config -c admin-cluster \
  --enable-apis \
  --create-service-accounts \
  --project-id=<span class="hljs-variable">$PROJECT_ID</span>
</code></pre>
<p>Take a look at the file that’s been created for yourself at <code>~/bmctl-workspace/admin-cluster/admin-cluster.yaml</code></p>
<p>Now we’ll use the power of sed to change a bunch of things in the draft configuration file. First, we’ll replace the SSH key placeholder with our actual SSH key:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|sshPrivateKeyPath: &lt;path to SSH private key, used for node access&gt;|sshPrivateKeyPath: <span class="hljs-subst">$(echo $SSH_PRIVATE_KEY)</span>|g"</span> bmctl-workspace/admin-cluster/admin-cluster.yaml
</code></pre>
<p>Then we’ll change the node type to admin:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|type: hybrid|type: admin|g"</span> bmctl-workspace/admin-cluster/admin-cluster.yaml
</code></pre>
<p>And finally update the IP addresses for the load balancer and control plane:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|- address: &lt;Machine 1 IP&gt;|- address: <span class="hljs-subst">$(echo $LB_CONTROLL_PLANE_NODE)</span>|g"</span> bmctl-workspace/admin-cluster/admin-cluster.yaml
sed -r -i <span class="hljs-string">"s|controlPlaneVIP: 10.0.0.8|controlPlaneVIP: <span class="hljs-subst">$(echo $LB_CONTROLL_PLANE_VIP)</span>|g"</span> bmctl-workspace/admin-cluster/admin-cluster.yaml
</code></pre>
<p>The draft configuration contains a complete <code>NodePool</code> section, but we don’t want that because in the design pattern we’ve chosen, our admin cluster doesn’t have any worker nodes. We can remove this section with the <code>head</code> command:</p>
<pre><code class="lang-bash">head -n -11 bmctl-workspace/admin-cluster/admin-cluster.yaml &gt; temp_file &amp;&amp; mv temp_file bmctl-workspace/admin-cluster/admin-cluster.yaml
</code></pre>
<p>Now we can actually ask <code>bmctl</code> to create the admin cluster (that is, to configure the existing VM):</p>
<pre><code class="lang-bash">bmctl create cluster -c admin-cluster
</code></pre>
<p><code>bmctl</code> will use the config file we have edited, then connect to the control plane node and set it up. Bootstrapping, creating the cluster and performing post-flight checks can take up to 20 minutes, so now is a good time to grab a drink!</p>
<p>When the cluster is finally ready, we’ll export the configuration for the admin cluster to <code>kubectx</code>, and then run <code>kubectl get nodes</code> just to make sure we can see the control plane node:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> KUBECONFIG=<span class="hljs-variable">$KUBECONFIG</span>:~/bmctl-workspace/admin-cluster/admin-cluster-kubeconfig
kubectx admin=.
kubectl get nodes
</code></pre>
<p>Finally, we create a Kubernetes service account that we’ll use to connect to the cluster from the Cloud Console. The last line in the script prints the token out:</p>
<pre><code class="lang-bash">kubectl create serviceaccount -n kube-system admin-user
kubectl create clusterrolebinding admin-user-binding \
    --clusterrole cluster-admin --serviceaccount kube-system:admin-user
kubectl create token admin-user -n kube-system
</code></pre>
<p>In the <strong>Kubernetes clusters</strong> page of the <strong>GKE</strong> section in the Google Cloud Console, you should now see your “bare metal” admin cluster. Click <strong>Connect</strong> from the 3 buttons action menu, and select <strong>Token</strong> as the method of authentication. Then paste in the token that was provided. This cluster can now be managed completely by GKE!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727026455000/5f7526e1-2105-4676-ad2b-c688ff9149d9.png" alt class="image--center mx-auto" /></p>
<p>Back on our workstation, we can run <code>kubectl get pods --all-namespaces</code> within the admin cluster context and list all of the different Pods it is running:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727026508600/f0aac485-83e2-44f1-af74-f649e1bdb773.png" alt class="image--center mx-auto" /></p>
<p>As you can see, the admin cluster is doing quite a bit of heavy lifting. In addition to the usual <code>kube-system</code> Pods, we can see Pods for Stackdriver (the original name of Google’s Operations Suite), and several operators for Anthos (the original name of GKE Enterprise). The Anthos Cluster Operator helps to provision and manage other Kubernetes clusters on physical servers, and the Anthos Multinet Controller is an admission controller for these clusters. You’ll also see the GKE Connect Agent running in its own <code>gke-connect</code> namespace, which facilitates communication between your bare metal clusters and your Google Cloud projects.</p>
<h3 id="heading-creating-a-user-cluster">Creating a user cluster</h3>
<p>Now the admin cluster is built, we go through a similar process for the user cluster, which is where our workloads will run. These steps can be found in the <code>user-cluster.sh</code> script, and once again they should be run from the admin workstation.</p>
<p>Just like before, we use <code>bmctl</code> to create a template configuration file:</p>
<pre><code class="lang-bash">bmctl create config -c user-cluster \
  --project-id=<span class="hljs-variable">$PROJECT_ID</span>
</code></pre>
<p>The default file references a credentials file we don’t use, so we’ll get rid of that section, and add the path to our private SSH key instead (once again using <code>sed</code>):</p>
<pre><code class="lang-bash">tail -n +11 bmctl-workspace/user-cluster/user-cluster.yaml &gt; temp_file &amp;&amp; mv temp_file bmctl-workspace/user-cluster/user-cluster.yaml
sed -i <span class="hljs-string">'1 i\sshPrivateKeyPath: /root/.ssh/id_rsa'</span> bmctl-workspace/user-cluster/user-cluster.yaml
</code></pre>
<p>This is a user cluster, so we’ll change the cluster type:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|type: hybrid|type: user|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
</code></pre>
<p>Now we need to set the IP addresses for the control plane node and the API server, a well as the Ingress and LoadBalancer services. Once again <code>sed</code> is our friend:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|- address: &lt;Machine 1 IP&gt;|- address: 10.200.0.4|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|controlPlaneVIP: 10.0.0.8|controlPlaneVIP: 10.200.0.99|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|# ingressVIP: 10.0.0.2|ingressVIP: 10.200.0.100|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|# addressPools:|addressPools:|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|# - name: pool1|- name: pool1|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|#   addresses:|  addresses:|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|#   - 10.0.0.1-10.0.0.4|  - 10.200.0.100-10.200.0.200|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
</code></pre>
<p>We’ll also enable infrastructure and application logging for the cluster:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|# disableCloudAuditLogging: false|disableCloudAuditLogging: false|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|# enableApplication: false|enableApplication: true|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
</code></pre>
<p>If you recall, we didn’t create a worker node for the admin cluster, we just removed the entire <code>NodePool</code> section. But our user cluster will have a node pool, albeit of just a single additional VM:</p>
<pre><code class="lang-bash">sed -r -i <span class="hljs-string">"s|name: node-pool-1|name: user-cluster-central-pool-1|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|- address: &lt;Machine 2 IP&gt;|- address: 10.200.0.5|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
sed -r -i <span class="hljs-string">"s|- address: &lt;Machine 3 IP&gt;|# - address: &lt;Machine 3 IP&gt;|g"</span> bmctl-workspace/user-cluster/user-cluster.yaml
</code></pre>
<p>Then we use <code>bmctl</code> create cluster to create the cluster using the updated configuration file. Note that we pass in our existing <code>kubeconfig</code> file so that we can reuse our certificate credentials. It should be slightly quicker for the user cluster to build, but it can still take 10-15 minutes:</p>
<pre><code class="lang-bash">bmctl create cluster -c user-cluster --kubeconfig bmctl-workspace/admin-cluster/admin-cluster-kubeconfig
</code></pre>
<p>When the cluster is up and running, we’ll create a new <code>kubectx</code> context for it, and use <code>kubectl</code> to make sure both nodes in the cluster are running:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> KUBECONFIG=~/bmctl-workspace/user-cluster/user-cluster-kubeconfig
kubectx user=.
kubectl get nodes
</code></pre>
<p>Just like before we will now create a Kubernetes service account that we’ll use in the Cloud Console to log into the cluster:</p>
<pre><code class="lang-bash">kubectl create serviceaccount -n kube-system admin-user
kubectl create clusterrolebinding admin-user-binding \
    --clusterrole cluster-admin --serviceaccount kube-system:admin-user
</code></pre>
<p>And finally, the script will output the token we need to log in via the Console:</p>
<pre><code class="lang-bash">kubectl create token admin-user -n kube-system
</code></pre>
<p>Back in the Cloud Console you should now see your user cluster. Once again you can click <strong>Connect</strong> from the 3 buttons action menu, select <strong>Token</strong> as the method of authentication and paste in the token that was provided. You should now see both of your bare metal clusters - but note that only the user cluster shows that it has any available resources for workloads, which makes sense because it’s the only cluster with worker nodes! (okay, 1 node!)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727084369140/06528001-12ba-49c5-9476-b4cd95035348.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-deploying-a-test-workload">Deploying a test workload</h3>
<p>We should now be able to create a test workload and expose it via a load balancer. We’ll use the same demo container as before to create a Hello World deployment. If you’ve been working through the scripts, you should still be logged into the admin workstation and <code>kubectl</code> should be authenticated against your user cluster.</p>
<p>We’ll use <code>kubectl create</code> to create a simple test deployment, followed by <code>kubectl expose</code> to create a service exposed by a load balancer:</p>
<pre><code class="lang-bash">kubectl create deployment hello-server --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
kubectl expose deployment hello-server --<span class="hljs-built_in">type</span> LoadBalancer --port 80 --target-port 8080
</code></pre>
<p>Within a few moments your deployment and service should be up and running. You can get the external IP of the service the usual way:</p>
<pre><code class="lang-bash">kubectl get service hello-server
</code></pre>
<p>However, this external IP exists only on the overlay network we created with VXLAN. To route to it from the public Internet for example, we’d also have to simulate some sort of external load balancer, which is outside the scope of this post! This is as far as we can go with simulating bare metal, so let's review what we have built:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727084508647/ce7f643e-dce4-4f23-b4b3-082369465149.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>We first set up the admin workstation, which we used as a base to provide the tools and other configuration artifacts we needed.</p>
</li>
<li><p>Then we deployed the admin cluster. This is a special Kubernetes cluster that acts as a management hub for our environment. Even though we issued the <code>bmctl</code> command from our workstation, the admin cluster was instrumental in the creation of the user cluster. The admin cluster is also responsible for communication with your Google Cloud project.</p>
</li>
<li><p>Next, we created the user cluster. User clusters are more like the traditional Kubernetes clusters that we know and love, as they are the places that deployed workloads will run.</p>
</li>
<li><p>Just like in our previous demonstrations, we deployed a “Hello World” app to test our new environment.</p>
</li>
</ol>
<p>Even though we've been simulating bare metal, this section has hopefully provided a good overview of the requirements for working with bare metal clusters in your own environments. As we’ve already discussed, there are many scenarios where this hybrid approach is appropriate, and the extra complexity involved should be justified by your use case.</p>
<h2 id="heading-building-gke-on-vmware">Building GKE on VMWare</h2>
<p>If you have existing investments in VMWare, such as a large VMWare estate or significant VMWare skills in your team, it can make sense to continue using these instead of or alongside cloud deployments. Just like with bare metal servers, GKE Enterprise can leverage VMWare to deploy fully managed GKE clusters. Your workloads can be deployed to GKE clusters on VMWare, bare metal or in the cloud with a single developer experience and a centralised approach to management.</p>
<p>GKE Enterprise works with VMWare in a very similar way to the bare metal approach we’ve already described and demonstrated, but with an added integration into the infrastructure automation provided by the VSphere software in VMWare. At a very high level this means that VSphere can automate some of the infrastructure tasks required, such as creating virtual machines, whereas in bare metal scenarios this is normally done manually.</p>
<p>So just like with bare metal, GKE on VMWare comprises the following components:</p>
<ul>
<li><p><strong>User clusters</strong> to run your containerised workloads. User clusters have one or more control plane nodes and one or more worker nodes, depending on your requirement for high availability.</p>
</li>
<li><p>An <strong>admin cluster</strong> to manage the user clusters. Again, this can be configured in a high availability pattern if your underlying infrastructure offers redundant points of failure.</p>
</li>
<li><p>An <strong>admin workstation</strong> to provide the configuration and tooling for building clusters. Our bare metal clusters used the <code>bmctl</code> tool for configuration and cluster building, but we’ll now use the GKE on VMWare tool called <code>gkectl</code>. This tool uses credentials for VCenter to automate the creation of virtual machines for all clusters, so you don’t have to build them manually.</p>
</li>
</ul>
<p>Let’s walk through a hypothetical build of a GKE environment on VMWare so we can examine some design considerations. Because the overall process is quite similar to GKE on Bare Metal, we won’t be detailing the full installation steps with scripts and commands, we’ll just focus on illustrating what’s different with VMWare. For a full installation guide, see: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/overview">https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/overview</a></p>
<h3 id="heading-preparing-your-vmware-environment">Preparing your VMWare environment</h3>
<p>At a bare minimum, GKE on VMWare requires at least one physical host running the ESXi hypervisor with 8 CPUs, 80GB or RAM and around half a terabyte of storage. At the time of writing, GKE on VMWare supports ESXi version 7.0u2 or higher and Center Server 7.0u2 or higher. Your vSphere environment requires:</p>
<ul>
<li><p>A <strong>vSphere virtual datacenter</strong> – this is essentially a virtual environment that contains all other configuration objects</p>
</li>
<li><p>A <strong>vSphere cluster</strong> – although a single node will work, it still needs to be configured as a cluster</p>
</li>
<li><p>A <strong>vSphere datastore</strong> – this will store the virtual machine files used to provision admin and user clusters for GKE</p>
</li>
<li><p>A <strong>vSphere network</strong> – essentially a virtual network. There are quite a few network requirements for VMWare, so we’ll get into them below.</p>
</li>
</ul>
<p>Just like GKE on Bare Metal, your VMWare environment will require IP address assignments for all nodes, virtual IPs for control plane components, and dedicated CIDR ranges for Pods and Services. It’s important to plan your IP allocations carefully, and how you do this will vary based on your existing VMWare deployment and other network infrastructure. Planning can also include the use of dynamically assigned IP addresses for nodes using DHCP, as vSphere builds the virtual machines for the clusters for you and can request IP addresses as it does so.</p>
<p>Your admin workstation and your clusters are likely to use IP addresses on your vSphere network, and so are the virtual IPs for the API server and cluster ingress. When you choose CIDR ranges for Pods and Services, it's recommended to use private IP ranges as specified in RFC1918. Typically, you spin up more Pods than Services, so for each cluster it’s recommended to create a Pod CIDR range larger than the Service CIDR range. For example, a user cluster could use the <code>192.168.0.0/16</code> block for Pods, and the <code>10.96.0.0/20</code> block for Services.</p>
<p>Finally, your clusters will need access to DNS and NTP services, which you may already have running within your VMWare environment.</p>
<h3 id="heading-creating-the-admin-workstation">Creating the admin workstation</h3>
<p>Once again, the admin workstation is the first thing we need before building any clusters. This time we don’t need to build a server (or create a VM) manually, because the tooling will do it for us.</p>
<p>First, we download the <code>gkeadm</code> tool. Instructions for downloading the latest version can be found here: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/how-to/download-gkeadm">https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/how-to/download-gkeadm</a></p>
<p>Then we’ll need to create a <code>credentials.yaml</code> file that contains our vCenter login information. Here’s an example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">CredentialFile</span>
<span class="hljs-attr">items:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">vCenter</span>
<span class="hljs-attr">username:</span> <span class="hljs-string">myusername</span>
<span class="hljs-attr">password:</span> <span class="hljs-string">mypassword</span>
</code></pre>
<p>Finally, we create a configuration file for the admin workstation. This will contain the details of the vSphere objects we discussed earlier (datacenter, datastore, cluster and network), and the virtual machine specifications used to build the workstation. You can find an example configuration file here: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/how-to/minimal-create-clusters#create_your_admin_workstation_configuration_file">https://cloud.google.com/kubernetes-engine/distributed-cloud/vmware/docs/how-to/minimal-create-clusters#create_your_admin_workstation_configuration_file</a></p>
<p>Once you’ve populated the configuration file with the details of your own vSphere environment, you can run this command:</p>
<pre><code class="lang-bash">gkeadm create admin-workstation --auto-create-service-accounts
</code></pre>
<p>The <code>gkeadm</code> tool then accesses vCenter and builds the admin workstation according to the specified configuration. It also sets up any necessary Google Cloud service accounts and creates some template configuration files for admin and user clusters.</p>
<h3 id="heading-preparing-vsphere-to-build-clusters">Preparing vSphere to build clusters</h3>
<p>Once the admin workstation is built, you can log into it and see that it has created some template cluster configurations, as well as service-account key files that will be used for connecting your clusters back to Google Cloud. The <code>admin-cluster.yaml</code> and <code>user-cluster.yaml</code> files contain the details of your vSphere environment, the sizing for cluster nodes, load balancing information and paths to the service accounts that will be used for GKE Connect as well as logging and monitoring.</p>
<p>Once you have edited or reviewed the configuration files, you’ll use the <code>gkectl prepare</code> command, which will check through the config and import the required images to vSphere, marking them as VM templates. Then you can run <code>gkectl create</code> for each cluster, and all of the required virtual machines and other resources will be built for you in your vSphere environment, thanks to <code>gkectl</code> communicating with your vCenter server. Your clusters are automatically enrolled in your GKE Enterprise fleet thanks to the GKE On-Prem API, which means you can manage them in the Google Cloud console or with gcloud commands just like any other GKE cluster.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this post, I’ve tried to demonstrate that GKE Enterprise can bring the benefits of a single developer experience and centralised management to Kubernetes clusters running in any environment, not just the cloud. At this point, your GKE fleet might contain clusters running on bare metal, VMWare, Google and AWS all at the same time, and all managed through a single interface! We also looked at some of the networking and other environmental design considerations required for bare metal and walked through a demonstration of a simulated bare metal environment, once again deploying our trusty “Hello World” application.</p>
<p>Thank you for reading this post! I know it’s a long one, but we’ve still barely scratched the surface of what GKE Enterprise can do. In the next post, we’ll get into some of the fun stuff - automation and configuration management.</p>
]]></content:encoded></item><item><title><![CDATA[Building GKE Clusters in Google Cloud and AWS]]></title><description><![CDATA[This is the second post in a series exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to sup...]]></description><link>https://timberry.dev/building-gke-clusters-in-google-cloud-and-aws</link><guid isPermaLink="true">https://timberry.dev/building-gke-clusters-in-google-cloud-and-aws</guid><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[google cloud]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Tue, 10 Sep 2024 10:00:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729603769730/bec960be-9082-4330-a901-edf7aae18657.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the second post in a series exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><strong><em>GKE Enterprise</em></strong></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. If you missed the first post, you might want to</em> <a target="_blank" href="https://timberry.dev/introducing-gke-enterprise"><em>start there</em></a><em>.</em></p>
<p>In this post, I'll begin to introduce the fundamental components of GKE Enterprise and highlight what makes it different from running a standard GKE cluster. To ease us into the hybrid world of GKE Enterprise, we’ll start just by focusing on cloud-based deployments in this post and explore VMWare and bare metal options next time around. This should give us a solid foundation to work from before we explore any further complex topics.</p>
<p>The primary benefit of using GKE Enterprise is the ability it gives you to run Kubernetes clusters in any cloud while maintaining a single developer experience and a centralised approach to management. You may have an organisational need to deploy to multiple clouds, perhaps for additional redundancy and availability purposes, or maybe just so you’re not putting all your eggs in a single vendor’s basket. GKE Enterprise allows you to deploy to Google Cloud, AWS and even Azure with a simplified workflow, so you have complete freedom to choose the target for your workload deployments.</p>
<p>If you want to follow along with anything in this post, you'll need a Google Cloud account (and an AWS account if you follow that part as well). Google used to offer an introductory cloud credit of $300, but that seems to have been changed into an unfathomable number of Gemini token credits, which aren't so useful for building infrastructure. Still, there are free usage quotas for most products. Meanwhile, AWS has a free tier that's equally difficult to decipher. In any case, if you are trying to learn this stuff to help with your job, by all means ask your employer to pay for it. Things can get expensive quickly!</p>
<p><strong>I'm going to repeat this one more time:</strong> 😬 Using any cloud vendor means you are liable to be charged for services – that's how they make their money after all. Even with free trials and special offers, you should expect to pay <em>something</em>, and that something can quickly turn into a <em>big something</em> if you forget to delete resources. Please be careful!</p>
<p>With that disclaimer out of the way, let's dive back into GKE Enterprise.</p>
<h2 id="heading-gke-enterprise-components">GKE Enterprise Components</h2>
<p>Fundamentally, GKE Enterprise is a system for running Kubernetes anywhere but managing it from a single place. Many different components are combined to achieve this which operate at different layers, from infrastructure to the network layer to service and policy management. At the lowest level, Kubernetes clusters are used to manage the deployment of workloads, and these can take several forms. In Google Cloud, you obviously deploy GKE clusters. When you use GKE Enterprise to deploy to environments outside of Google Cloud, the service will create GKE-managed clusters which it will operate for you (these were formally referred to as "GKE On-Prem clusters" even if they ran in other clouds). Finally, you can even choose to bring the management of an existing vanilla Kubernetes cluster into GKE Enterprise.</p>
<p>At the network layer, GKE Enterprise will leverage various connectivity options to communicate with your other environments. You may choose to set up VPN connections for example, or configure a dedicated Interconnect if you expect high traffic between your Google Cloud projects and other locations. GKE Enterprise communicates with clusters in other environments using a component called the Connect Gateway. This gateway provides a communication bridge between a remote cluster’s API and the control plane of your GKE Enterprise environment. Using the gateway allows GKE Enterprise to issue commands and control additional clusters on other clouds or on-premises. We’ll see how this is used to set up clusters in AWS later in this post. Above this layer are tools to control our workloads, such as service mesh, configuration management, and policy control, which we’ll discuss much later on in this series.</p>
<p>As you can imagine, when operating and managing multiple clusters, you will need some additional help to organise all of your resources, which is why GKE Enterprise uses a concept called <strong>fleets</strong>. A <strong>fleet</strong> is simply a logical grouping of Kubernetes clusters, allowing you to manage all clusters in that fleet rather than dealing with them individually. A single fleet can contain clusters that all run within Google Cloud or a combination of clusters running across a variety of environments. Fleets offer many different design patterns for workload, environment, and tenant isolation and also provide other powerful features beyond simple logical cluster groupings. I'll cover these options in more detail later in the series, but for now, you just need to know that all clusters in GKE Enterprise must belong to a fleet.</p>
<h2 id="heading-building-your-first-enterprise-cluster">Building your first Enterprise cluster</h2>
<p>With enough of the basic concepts covered, it’s now time to create your first GKE Enterprise cluster. To do this, we’ll first enable GKE Enterprise and set up a fleet associated with our project. Then we’ll create a GKE cluster and register it to that fleet. This post is about deploying GKE in the cloud – not just in Google Cloud, so once our fleet is running, we’ll set up a similar cluster in AWS and observe how GKE’s fleet management allows us to centrally manage both clusters.</p>
<h3 id="heading-enabling-gke-enterprise">Enabling GKE Enterprise</h3>
<p>As I mentioned in the first post, GKE Enterprise is an add-on subscription service. To enable it, you will need a Google Cloud project with a valid billing account, and while it does come with its own free trial, the switch from GKE Standard to GKE Enterprise is likely to take you out of any free tier or free credits you may have. Once GKE Enterprise is enabled, all registered GKE Enterprise clusters will incur per-vCPU charges. Please read the pricing guide for more information at: <a target="_blank" href="https://cloud.google.com/kubernetes-engine/pricing">https://cloud.google.com/kubernetes-engine/pricing</a>.</p>
<p>Have I scared you off yet? 😂 If you're still here, let's do this!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725373158554/7617ca23-045a-47ba-a3ce-22845f99efe9.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>From the Google Cloud console page, select <strong>Kubernetes Engine</strong> in the navigation menu.</p>
</li>
<li><p>In the <strong>Overview</strong> section of Kubernetes Engine, you will see a button that says <strong>Learn About GKE Enterprise</strong>. Click this button.</p>
</li>
<li><p>A pop-over box will appear containing details of the benefits of GKE Enterprise. At the bottom of this box is a button labeled <strong>Enable GKE Enterprise</strong>. You can optionally tick the 90-day free trial option (if it is still on offer when you’re reading this) which should waive the per-vCPU charges for 90 days.</p>
</li>
<li><p>Click the <strong>Enable GKE Enterprise</strong> button. A confirmation pop-over will appear next. This shows you the APIs that will be enabled for your project and it will also tell you the name of the default fleet that will be created for GKE Enterprise. You can edit, create, and delete this and other fleets later.</p>
</li>
<li><p>Click <strong>Confirm</strong> to proceed with enabling GKE Enterprise.</p>
</li>
</ol>
<p>After a few moments, you will see a confirmation message. It will also prompt you to create your first cluster, but we’ll do that separately in a moment. You can now close this pop-over box and return to the Kubernetes Engine section. You should see that you’re back on the overview page with a dashboard showing you the current state of your fleet.</p>
<h3 id="heading-creating-the-cluster">Creating the cluster</h3>
<p>Creating a cluster inside a GKE Enterprise fleet is very similar to how you create a GKE Standard edition cluster, although you’ll see a few extra options are now available to you.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725373259255/8f6ef2d0-235f-46db-8410-266f2a63e0b5.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>From the Kubernetes Engine section, select <strong>Clusters</strong> in the left-hand menu.</p>
</li>
<li><p>Click the <strong>Create</strong> button at the top of the page. The cluster configuration dialog box that pops up contains options for creating standard or autopilot clusters on Google Cloud, as well as GKE clusters on other clouds and even on-premises. At the time of writing, not all of these were supported as options in the UI and some required using the command line instead. For now, we’ll stick with the recommended Autopilot cluster type on Google Cloud. Click <strong>Configure</strong> on this option.</p>
</li>
<li><p>We could click <strong>Create</strong> at this stage and accept all of the defaults, but let’s walk through the setup so we understand what we’re creating: On the first page, <strong>Cluster basics</strong>, give your cluster a name of your choice and choose a region. This could be a region closest to you, or if you prefer you can use the default region which may be cheaper. Then click <strong>Next: Fleet Registration</strong>.</p>
</li>
<li><p>On the next page, you can choose to register your cluster to your fleet. As you can see, there’s an option to skip this, but the whole point of GKE Enterprise is to manage clusters across different environments, so go ahead and check the box. You’ll see a pop-up notifying you that no additional fleet features have been configured so far, but we’ll come back to those in a later post. Now you can click <strong>Next: Networking</strong>.</p>
</li>
<li><p>Just like a GKE Standard cluster, we have several networking options that you are probably already familiar with, such as which network and subnet your GKE nodes should run in, and whether the cluster should run with public or private IP addresses. Let’s leave this at the defaults for now and click <strong>Next: Advanced Settings</strong>.</p>
</li>
<li><p>In the Advanced Settings section, we can choose the release channel, which again is the same principle we would use for a GKE Standard cluster. Let’s stick to the regular channel for now. Notice all the drop-down boxes for the different advanced features. Have a look through and see what GKE Enterprise can offer you, but for now, don’t select anything other than the defaults. We’ll explore a lot of these options later in the blog. Now click <strong>Next: Review and Create</strong>.</p>
</li>
<li><p>Finally, review the options you’ve chosen for your cluster, and click <strong>Create Cluster</strong>.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725373480707/c32ba230-ce23-4e99-ac81-d22b11680952.png" alt class="image--center mx-auto" /></p>
<p>After 5-10 minutes, your GKE Autopilot cluster will be built, ready to use, and registered to your GKE Enterprise fleet.</p>
<h3 id="heading-testing-a-deployment">Testing a Deployment</h3>
<p>Now we'll use the built-in deployment tool in the cloud console to test a deployment to our new cluster. Of course, you’re welcome to use any other deployment method you may be familiar with (for example, command-line <code>kubectl</code>), because this is just a regular Kubernetes cluster behind the scenes, even if Google manages the nodes for us in autopilot mode.</p>
<ol>
<li><p>From the Kubernetes Engine section, select <strong>Workloads</strong> in the left-hand menu.</p>
</li>
<li><p>Click the <strong>Deploy</strong> button at the top of the page. Enter the following container image path as a single line: <code>us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0</code></p>
</li>
<li><p>Click <strong>Continue</strong> then change the deployment name to <code>hello-world</code>. Then click <strong>Continue</strong> again (don’t deploy it just yet!)</p>
</li>
<li><p>In the optional <strong>Expose</strong> section, check the box to also create a matching service for the deployment. In the <strong>Port Mapping</strong> section that appears, make sure you change the <code>Item 1: Port 1</code> port number to <code>8080</code> – this is the port the container listens on. Accept the defaults and click <strong>Deploy</strong>.</p>
</li>
</ol>
<p>After a few minutes, your deployment should be up and running and your service should be exposed. If you see errors at first (about Pods not being able to be scheduled, or the deployment not meeting minimum availability), just wait a few more minutes and refresh the page. Autopilot is scaling up the necessary infrastructure for your workload behind the scenes.</p>
<p>When your workload shows a green tick, you can click the link at the bottom of the page under <strong>Endpoints</strong> to view your new application. In a moment, we’ll repeat these steps for a completely different cloud just to see how easy it is to perform multi-cloud deployments from GKE Enterprise.</p>
<h3 id="heading-automating-with-terraform">Automating with Terraform</h3>
<p>Just like everything else in Google Cloud, we can automate the creation of GKE Enterprise components (and Kubernetes components) too. Let's walk through a basic example of using Terraform to automate everything we’ve just built so far. I don't want to go off on a tangent into how to learn Terraform from scratch, so to follow along below I'll assume that you already have the Terraform command line tools set up and ready to go for your Google Cloud project. You’ll need a service account with the necessary project permissions and a JSON key file which we’ll refer to as <code>serviceaccount.json</code>.</p>
<p>You can also find the code below in my Github repo here: <a target="_blank" href="https://github.com/timhberry/gke-enterprise/tree/main/gke-cluster-tf">https://github.com/timhberry/gke-enterprise/tree/main/gke-cluster-tf</a></p>
<p>First, we’ll create a file called <code>terraform.tfvars</code>. This file will store the name of our project ID and the default Google Cloud region we want to use. Make sure you update these values to match your own project and preferred region:</p>
<pre><code class="lang-json">project_id = <span class="hljs-string">"my-googlecloud-project"</span>
region     = <span class="hljs-string">"us-east1"</span>
</code></pre>
<p>Next, we’ll create a new VPC and subnetwork to use for our cluster. In the example we went through a moment ago we used the default VPC network that is configured by default in all Google Cloud projects. However, when using infrastructure as code, it is better to create and manage all resources rather than to rely on the assumption that they exist and are accessible.</p>
<p>Create the following <code>vpc.tf</code> file:</p>
<pre><code class="lang-json">variable <span class="hljs-string">"project_id"</span> {
  description = <span class="hljs-attr">"project id"</span>
}

variable <span class="hljs-string">"region"</span> {
  description = <span class="hljs-attr">"region"</span>
}

provider <span class="hljs-string">"google"</span> {
  credentials = file(<span class="hljs-attr">"serviceaccount.json"</span>)
  project     = var.project_id
  region      = var.region
}

resource <span class="hljs-string">"google_compute_network"</span> <span class="hljs-string">"gke-vpc"</span> {
  name                    = <span class="hljs-attr">"gke-vpc"</span>
  auto_create_subnetworks = <span class="hljs-attr">"false"</span>
}

resource <span class="hljs-string">"google_compute_subnetwork"</span> <span class="hljs-string">"gke-subnet"</span> {
  name          = <span class="hljs-attr">"gke-subnet"</span>
  region        = var.region
  network       = google_compute_network.gke-vpc.name
  ip_cidr_range = <span class="hljs-attr">"10.10.0.0/24"</span>
}
</code></pre>
<p>Let's walk through this code.</p>
<ul>
<li><p>First, we create two variables that we can use later: <code>project_id</code> and <code>region</code>. This allows us to read the variables from the <code>terraform.tfvars</code> file that we created first. This makes our code somewhat portable and reusable, so it's a good practice to follow.</p>
</li>
<li><p>Next, we set up the Google provider. Here we pass in the <code>serviceaccount.json</code> credentials file that we mentioned earlier and configure our project ID and default region.</p>
</li>
<li><p>Finally, we create two new resources, a <code>google_compute_network</code> (ie., a VPC), and a <code>google_compute_subnetwork</code> inside it. As you can see, we switch off the automatic creation of subnetworks, so this VPC will only contain the one subnetwork that we have chosen to create. This subnetwork gets set up in the region we chose, with an IP address range of <code>10.10.0.0/24</code>.</p>
</li>
</ul>
<p>Now we’ll define the cluster itself. Create the following <code>cluster.tf</code> file:</p>
<pre><code class="lang-json">resource <span class="hljs-string">"google_gke_hub_fleet"</span> <span class="hljs-string">"dev-fleet"</span> {
  display_name = <span class="hljs-attr">"My Dev Fleet"</span>
}

resource <span class="hljs-string">"google_container_cluster"</span> <span class="hljs-string">"gcp-cluster"</span> {
  name             = <span class="hljs-attr">"gcp-cluster"</span>
  location         = var.region
  enable_autopilot = true
  network          = google_compute_network.gke-vpc.name
  subnetwork       = google_compute_subnetwork.gke-subnet.name
}

resource <span class="hljs-string">"google_gke_hub_membership"</span> <span class="hljs-string">"membership"</span> {
  membership_id = <span class="hljs-attr">"basic"</span>
  location      = var.region
  endpoint {
    gke_cluster {
      resource_link = <span class="hljs-attr">"//container.googleapis.com/${google_container_cluster.gcp-cluster.id}"</span>
    }
  }
}

data <span class="hljs-string">"google_client_config"</span> <span class="hljs-string">"provider"</span> {}

data <span class="hljs-string">"google_container_cluster"</span> <span class="hljs-string">"gcp-cluster"</span> {
  name     = <span class="hljs-attr">"gcp-cluster"</span>
  location = var.region
}

provider <span class="hljs-string">"kubernetes"</span> {
  host  = <span class="hljs-attr">"https://${data.google_container_cluster.gcp-cluster.endpoint}"</span>
  token = data.google_client_config.provider.access_token
  cluster_ca_certificate = base64decode(
    data.google_container_cluster.gcp-cluster.master_auth[0].cluster_ca_certificate,
  )
}
</code></pre>
<p>There’s quite a lot to unpack here, so let’s go through the code block by block.</p>
<ul>
<li><p>The first resource we create is a <code>google_gke_hub_fleet</code>, which we call <code>dev-fleet</code>. This is our GKE Enterprise fleet, which as we’ve discussed can manage multiple clusters for us.</p>
</li>
<li><p>Next, we create the <code>google_container_cluster</code>. We give the cluster a name, specify its location and network settings, and make sure we turn on Autopilot mode. There are of course all kinds of other options we could specify here for the configuration of our cluster, but to keep it simple we’re just specifying the required settings, and we’ll accept the defaults for everything else.</p>
</li>
<li><p>Now we have a cluster, we can join it to the fleet. We do this with a <code>google_gke_hub_membership</code> resource. This simply joins the cluster we just created by using Terraform’s own resource link.</p>
</li>
<li><p>Finally, we want to be able to use Terraform to actually configure some Kubernetes objects, not just the cluster and fleet themselves. So we need to configure a Kubernetes provider. We do this by configuring our new cluster as a data source and then referencing it to extract the cluster certificate in the configuration of our Kubernetes provider. Doing this means that Terraform can send authenticated requests directly to the Kubernetes API to create and manage objects for us.</p>
</li>
</ul>
<p>At this point, we have declared our fleet and our cluster, and we’ve configured the Kubernetes provider. Now we can move away from infrastructure and onto our workloads themselves.</p>
<p>The last file we’ll create is a <code>deployment.tf</code> that contains our Kubernetes deployment and service:</p>
<pre><code class="lang-json">resource <span class="hljs-string">"kubernetes_deployment"</span> <span class="hljs-string">"hello-world"</span> {
  metadata {
    name = <span class="hljs-attr">"hello-world"</span>
    labels = {
      app = <span class="hljs-attr">"hello-world"</span>
    }
  }
  spec {
    selector {
      match_labels = {
        app = <span class="hljs-attr">"hello-world"</span>
      }
    }
    template {
      metadata {
        labels = {
          app = <span class="hljs-attr">"hello-world"</span>
        }
      }
      spec {
        container {
          image = <span class="hljs-attr">"us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0"</span>
          name  = <span class="hljs-attr">"hello-world"</span>
        }
      }
    }
  }
}

resource <span class="hljs-string">"kubernetes_service"</span> <span class="hljs-string">"hello-world"</span> {
  metadata {
    name = <span class="hljs-attr">"hello-world"</span>
  }
  spec {
    selector = {
      app = kubernetes_deployment.hello-world.spec.0.template.0.metadata.0.labels.app
    }
    port {
      port        = 80
      target_port = 8080
    }
    type = <span class="hljs-string">"LoadBalancer"</span>
  }
}
</code></pre>
<p>You should already be familiar with these basic Kubernetes building blocks, so we don’t need to examine this code too carefully. As you can probably tell, the configuration is very similar to what we would write if we were just sending a YAML file directly into <code>kubectl</code>. In Terraform of course, we use Hashicorp’s configuration language (HCL) which looks a little bit like JSON, just more verbose. All the usual things are there for a deployment and a service, including the container image name and target port. You can see in the service where we link the selector to the Terraform resource for the deployment.</p>
<p>If your Terraform environment is configured correctly and your service account has the correct permissions, you should now be able to spin up these resources by following the normal Terraform workload of <code>init</code>, <code>plan</code>, and <code>apply</code>. If you have any issues with your code, you can check it against the GitHub repo I mentioned earlier.</p>
<p>One final note on fleets – although again, we’ll be exploring these in much more detail later in the series. If you already created a fleet by hand following the steps earlier in this post, and then you tried to create a new fleet in the same project using Terraform, you may have seen an error. That’s because a single project can only contain a single fleet. If you still want to proceed, you can simply remove the <code>google_gke_hub_fleet</code> resource from the Terraform code, so it doesn’t try to create a new fleet. Leave in the <code>google_gke_hub_membership</code> resource for your cluster, and it will join the existing fleet in the project.</p>
<p>Setting up our first cluster inside Google Cloud probably seemed quite easy, and that’s because it is most definitely designed that way. Next, we’ll look at how we can set up GKE Enterprise to build a cluster in AWS, which will involve laying down a few more foundations first. As you read through these sections you may find that the steps involved are quite long and complicated, but bear in mind that much of the groundwork only needs to be done once, and of course, it too can be automated by tools like Terraform.</p>
<h2 id="heading-building-clusters-on-aws">Building clusters on AWS</h2>
<p>Getting GKE Enterprise ready to deploy into AWS for you requires many different moving parts. At a high level, we need to perform the following steps:</p>
<ol>
<li><p>Create an AWS VPC with the correct subnets, routes and gateways.</p>
</li>
<li><p>Configure keys for the encryption of EC2 instance data, EBS volumes in AWS and ETCD state data in our clusters.</p>
</li>
<li><p>Create IAM roles and SSH key pairs GKE runs within its own VPC inside your AWS account, which will include private and public subnets.</p>
</li>
</ol>
<p>The private subnets will host the EC2 virtual machines that provide control plane nodes as well as worker nodes (organised into node pools) for your cluster. An internal network load balancer will provide the front end for the control plane. Finally, NAT gateways will operate in the public subnets to provide connectivity for the private subnets.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725375136198/1df2d405-fc3e-46e6-ba4c-e175510f2b46.png" alt class="image--center mx-auto" /></p>
<p>The Anthos Multi-Cloud API (as it was still named at the time of writing) communicates with the AWS API to manage AWS resources, while the Connect API communicates with the Kubernetes API of the cluster running in AWS. This allows you to create and manage your AWS clusters entirely from within the Google Cloud console, provided the foundations have been configured first.</p>
<p>There is <em>a lot</em> of groundwork to cover in setting up AWS for GKE. Many complex commands are required over hundreds and hundreds of lines. You definitely want to automate this! I'll walk through everything below, but I've also provided the commands in a set of scripts in my GitHub repo here: <a target="_blank" href="https://github.com/timhberry/gke-enterprise/tree/main/aws-cluster-scripts">https://github.com/timhberry/gke-enterprise/tree/main/aws-cluster-scripts</a></p>
<p>For reference, the scripts are:</p>
<ul>
<li><p><code>aws-keys.sh</code> - Creating KMS and SSH key pairs</p>
</li>
<li><p><code>aws-iam.sh</code> - Creating the necessary IAM roles, policies and profiles</p>
</li>
<li><p><code>aws-vpc.sh</code> - Creating the VPC network, subnets, routes and gateways</p>
</li>
<li><p><code>aws-cluster.sh</code> - Finally creating the cluster and node pool</p>
</li>
</ul>
<p>The commands in these scripts rely on you having the <code>gcloud</code> and <code>aws</code> command-line tools installed and configured, and you'll also need the <code>jq</code> JSON tool. Finally, I've only tested these commands on Linux and MacOS. If you are using Windows, it’s recommended to use the Windows Subsystem for Linux to provide a proper Linux shell experience. Alternatively, you can use the Cloud Shell terminal in the Google Cloud console. Also, note that the scripts use the <code>us-east-1</code> region, but you can change that if you prefer. I also have a convention of prefixing names with <code>gke-cluster</code> which you can also change if you like.</p>
<p>Oh, and don't just run the scripts as-is! They're supposed to helpfully compile the commands together, but you should definitely run each command one at a time to help you spot any errors. Some commands create environment variables that are used in later commands. So much preamble!</p>
<h3 id="heading-encryption-keys-first">Encryption keys first</h3>
<p>GKE Enterprise uses the AWS Key Management Service (KMS) to create symmetric encryption keys. These are then used for encryption of Kubernetes state data in etcd, EC2 instance user data, and at-rest encryption of data on EBS volumes including control plane and node pool data.</p>
<p>In a production environment, you should use different keys for different components, for example by encrypting configuration and volumes separately. It’s also recommended to further secure the key policy by using a minimum set of KMS permissions. However, this is not an AWS Security post, so for now, we’ll be creating a single KMS key for our purposes.</p>
<p><em>Side note:</em> everything you create in AWS gets a unique resource name, or ARN, and often we need to reference the ARNs of resources we've already created. When we need to do this, we'll pipe the output to <code>jq</code> and store the result in an environment variable.</p>
<pre><code class="lang-bash">KMS_KEY_ARN=$(aws --region us-east-1 kms create-key \
    --description <span class="hljs-string">"gke-key"</span> \
    --output json| jq -r <span class="hljs-string">'.KeyMetadata.Arn'</span>)
</code></pre>
<p>Next, we'll create an EC2 SSH key pair for our EC2 instances, just in case we ever need to connect to them for troubleshooting purposes:</p>
<pre><code class="lang-bash">ssh-keygen -t rsa -m PEM -b 4096 -C <span class="hljs-string">"GKE key pair"</span> \
      -f gke-key -N <span class="hljs-string">""</span> 1&gt;/dev/null
aws ec2 import-key-pair --key-name gke-key \
      --public-key-material fileb://gke-key.pub
</code></pre>
<h3 id="heading-iam-permissions">IAM permissions</h3>
<p>Several sets of IAM permissions will need to be set up in your AWS account for GKE Enterprise to work. First, let's set some environment variables that reference the Google Cloud project where our fleet exists:</p>
<pre><code class="lang-bash">PROJECT_ID=<span class="hljs-string">"<span class="hljs-subst">$(gcloud config get-value project)</span>"</span>
PROJECT_NUMBER=$(gcloud projects describe <span class="hljs-string">"<span class="hljs-variable">$PROJECT_ID</span>"</span> \
    --format <span class="hljs-string">"value(projectNumber)"</span>)
</code></pre>
<p>We'll now create an AWS IAM role for the GKE Multi-Cloud API. Remember, this is the API that will communicate directly with the AWS API when we ask GKE to run commands or control components within AWS:</p>
<pre><code class="lang-bash">aws iam create-role --role-name gke-api-role \
    --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
            "Federated": "accounts.google.com"
        },
        "Action": "sts:AssumeRoleWithWebIdentity",
        "Condition": {
            "StringEquals": {
            "accounts.google.com:sub": "service-'</span><span class="hljs-variable">$PROJECT_NUMBER</span><span class="hljs-string">'@gcp-sa-gkemulticloud.iam.gserviceaccount.com"
            }
      }
    }
  ]
}'</span>
</code></pre>
<p>Almost forgot! We actually need the ARN for this role, so let's grab it:</p>
<pre><code class="lang-bash">API_ROLE_ARN=$(aws iam list-roles \
  --query <span class="hljs-string">'Roles[?RoleName==`gke-api-role`].Arn'</span> \
  --output text)
</code></pre>
<p>Now, the GKE IAM role will needs lots of different permissions, so we'll create a new AWS IAM policy for this:</p>
<pre><code class="lang-bash">API_POLICY_ARN=$(aws iam create-policy --policy-name gke-api-policy \
  --policy-document <span class="hljs-string">'{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CreateLaunchTemplate",
        "ec2:CreateNetworkInterface",
        "ec2:CreateSecurityGroup",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:DeleteLaunchTemplate",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DescribeAccountAttributes",
        "ec2:DescribeInstances",
        "ec2:DescribeInternetGateways",
        "ec2:DescribeKeyPairs",
        "ec2:DescribeLaunchTemplates",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeSecurityGroupRules",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcs",
        "ec2:GetConsoleOutput",
        "ec2:ModifyInstanceAttribute",
        "ec2:ModifyNetworkInterfaceAttribute",
        "ec2:RevokeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RunInstances",
        "iam:AWSServiceName",
        "iam:CreateServiceLinkedRole",
        "iam:GetInstanceProfile",
        "iam:PassRole",
        "autoscaling:CreateAutoScalingGroup",
        "autoscaling:CreateOrUpdateTags",
        "autoscaling:DeleteAutoScalingGroup",
        "autoscaling:DeleteTags",
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DisableMetricsCollection",
        "autoscaling:EnableMetricsCollection",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "autoscaling:UpdateAutoScalingGroup",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:CreateListener",
        "elasticloadbalancing:CreateLoadBalancer",
        "elasticloadbalancing:CreateTargetGroup",
        "elasticloadbalancing:DeleteListener",
        "elasticloadbalancing:DeleteLoadBalancer",
        "elasticloadbalancing:DeleteTargetGroup",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeTargetGroups",
        "elasticloadbalancing:DescribeTargetHealth",
        "elasticloadbalancing:ModifyTargetGroupAttributes",
        "elasticloadbalancing:RemoveTags",
        "kms:DescribeKey",
        "kms:Encrypt",
        "kms:GenerateDataKeyWithoutPlaintext"
      ],
      "Resource": "*"
    }
  ]
}'</span> --output json | jq -r <span class="hljs-string">".Policy.Arn"</span>)
</code></pre>
<p>Now we'll attach the policy to the role:</p>
<pre><code class="lang-bash">aws iam attach-role-policy \
    --policy-arn <span class="hljs-variable">$API_POLICY_ARN</span> \
    --role-name gke-api-role
</code></pre>
<p>Next, we create another IAM role, this time for the nodes that will make up our control plan:</p>
<pre><code class="lang-bash">aws iam create-role --role-name control-plane-role \
    --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
    {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
             "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }
  ]
}'</span>
</code></pre>
<p>We'll also need to create a new policy for this role:</p>
<pre><code class="lang-bash">CONTROL_PLANE_POLICY_ARN=$(aws iam create-policy --policy-name control-plane-policy \
  --policy-document <span class="hljs-string">'{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
        "ec2:AttachNetworkInterface",
        "ec2:AttachVolume",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CreateRoute",
        "ec2:CreateSecurityGroup",
        "ec2:CreateSnapshot",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:DeleteRoute",
        "ec2:DeleteSecurityGroup",
        "ec2:DeleteSnapshot",
        "ec2:DeleteTags",
        "ec2:DeleteVolume",
        "ec2:DescribeAccountAttributes",
        "ec2:DescribeAvailabilityZones",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInternetGateways",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DescribeRegions",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSnapshots",
        "ec2:DescribeSubnets",
        "ec2:DescribeTags",
        "ec2:DescribeVolumes",
        "ec2:DescribeVolumesModifications",
        "ec2:DescribeVpcs",
        "ec2:DetachVolume",
        "ec2:ModifyInstanceAttribute",
        "ec2:ModifyVolume",
        "ec2:RevokeSecurityGroupIngress",
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeTags",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "elasticloadbalancing:AddTags",
        "elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
        "elasticloadbalancing:AttachLoadBalancerToSubnets",
        "elasticloadbalancing:ConfigureHealthCheck",
        "elasticloadbalancing:CreateListener",
        "elasticloadbalancing:CreateLoadBalancer",
        "elasticloadbalancing:CreateLoadBalancerListeners",
        "elasticloadbalancing:CreateLoadBalancerPolicy",
        "elasticloadbalancing:CreateTargetGroup",
        "elasticloadbalancing:DeleteListener",
        "elasticloadbalancing:DeleteLoadBalancer",
        "elasticloadbalancing:DeleteLoadBalancerListeners",
        "elasticloadbalancing:DeleteTargetGroup",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:DescribeListeners",
        "elasticloadbalancing:DescribeLoadBalancerAttributes",
        "elasticloadbalancing:DescribeLoadBalancerPolicies",
        "elasticloadbalancing:DescribeLoadBalancers",
        "elasticloadbalancing:DescribeTargetGroups",
        "elasticloadbalancing:DescribeTargetHealth",
        "elasticloadbalancing:DetachLoadBalancerFromSubnets",
        "elasticloadbalancing:ModifyListener",
        "elasticloadbalancing:ModifyLoadBalancerAttributes",
        "elasticloadbalancing:ModifyTargetGroup",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets",
        "elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
        "elasticloadbalancing:SetLoadBalancerPoliciesOfListener",
        "elasticfilesystem:CreateAccessPoint",
        "elasticfilesystem:DeleteAccessPoint",
        "elasticfilesystem:DescribeAccessPoints",
        "elasticfilesystem:DescribeFileSystems",
        "elasticfilesystem:DescribeMountTargets",
        "kms:CreateGrant",
        "kms:Decrypt",
        "kms:Encrypt",
        "kms:GrantIsForAWSResource"
      ],
      "Resource": "*"
    }
  ]
}'</span> --output json | jq -r <span class="hljs-string">".Policy.Arn"</span>)
</code></pre>
<p>And then attach that policy to the role:</p>
<pre><code class="lang-bash">aws iam attach-role-policy \
    --policy-arn <span class="hljs-variable">$CONTROL_PLANE_POLICY_ARN</span> \
    --role-name control-plane-role
</code></pre>
<p>Now, because we're using this role and policy with EC2 instances, we'll also create an instance profile, and add the role to it:</p>
<pre><code class="lang-bash">CONTROL_PLANE_PROFILE=control-plane-profile
aws iam create-instance-profile \
    --instance-profile-name <span class="hljs-variable">$CONTROL_PLANE_PROFILE</span>
aws iam add-role-to-instance-profile \
    --instance-profile-name <span class="hljs-variable">$CONTROL_PLANE_PROFILE</span> \
    --role-name control-plane-role
</code></pre>
<p>Finally, we'll go through a similar process for the node pools: first creating an IAM policy, then attaching it to a role, creating an instance profile, then adding the role to the instance profile:</p>
<pre><code class="lang-bash">aws iam create-role --role-name node-pool-role \
    --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
    {
        "Sid": "",
        "Effect": "Allow",
        "Principal": {
        "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
    }
  ]
}'</span>

NODE_POOL_POLICY_ARN=$(aws iam create-policy --policy-name node-pool-policy_kms \
  --policy-document <span class="hljs-string">'{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt"],
      "Resource": "'</span><span class="hljs-variable">$KMS_KEY_ARN</span><span class="hljs-string">'"
    }
  ]
}'</span> --output json | jq -r <span class="hljs-string">".Policy.Arn"</span>)

aws iam attach-role-policy --role-name node-pool-role \
    --policy-arn <span class="hljs-variable">$NODE_POOL_POLICY_ARN</span>

NODE_POOL_PROFILE=node-pool-profile
aws iam create-instance-profile \
    --instance-profile-name <span class="hljs-variable">$NODE_POOL_PROFILE</span>

aws iam add-role-to-instance-profile \
    --instance-profile-name <span class="hljs-variable">$NODE_POOL_PROFILE</span> \
    --role-name node-pool-role
</code></pre>
<p>Now we finally have our encryption keys, IAM profiles and roles set up, and we can move onto networking!</p>
<h3 id="heading-aws-networking">AWS Networking</h3>
<p>Our GKE clusters will require a dedicated VPC, subnets, internet gateways, routing tables, elastic IPs and NAT gateways to work. There's a lot to get through, so let's get started!</p>
<p>First, the easy bit. We'll create a new VPC with a CIDR of <code>10.0.0.0/16</code>. Feel free to adjust the region to suit your requirements:</p>
<pre><code class="lang-bash">aws --region us-east-1 ec2 create-vpc \
    --cidr-block 10.0.0.0/16 \
    --tag-specifications <span class="hljs-string">'ResourceType=vpc, Tags=[{Key=Name,Value=gke-cluster-VPC}]'</span>
</code></pre>
<p>Next, we'll capture the ID of the VPC, and use it to enable support for DNS and hostnames:</p>
<pre><code class="lang-bash">VPC_ID=$(aws ec2 describe-vpcs \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-VPC'</span> \
  --query <span class="hljs-string">"Vpcs[].VpcId"</span> --output text)
aws ec2 modify-vpc-attribute --enable-dns-hostnames --vpc-id <span class="hljs-variable">$VPC_ID</span>
aws ec2 modify-vpc-attribute --enable-dns-support --vpc-id <span class="hljs-variable">$VPC_ID</span>
</code></pre>
<p>The next step is to create private subnets in our VPC for our control plane nodes. By default, a GKE cluster in AWS is the equivalent of a private cluster, with no direct access to the API permissable via the Internet. We will instead connect to it via GKE’s Connect API, as we’ll see later on. You can create fewer than three subnets in your VPC if you wish, but three are recommended as GKE will create three control plane nodes regardless. Spreading them across three different availability zones gives you better redundancy in the event of any zonal outages.</p>
<p>In this example I'm using <code>us-east-1a</code>, <code>us-east-1b</code> and <code>us-east-1c</code> availability zones and assigning each of them a <code>/24</code> CIDR block. If you’re using a different region you’ll want to change these availability zones accordingly.</p>
<pre><code class="lang-bash">aws ec2 create-subnet \
  --availability-zone us-east-1a \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.1.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PrivateSubnet1}]'</span>
aws ec2 create-subnet \
  --availability-zone us-east-1b \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.2.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PrivateSubnet2}]'</span>
aws ec2 create-subnet \
  --availability-zone us-east-1c \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.3.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PrivateSubnet3}]'</span>
</code></pre>
<p>Next we need to create three public subnets. These will be used to provide outbound internet access for the private subnets (once again, you may need to change the availability zones if you’re using a different region).</p>
<pre><code class="lang-bash">aws ec2 create-subnet \
  --availability-zone us-east-1a \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.101.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PublicSubnet1}]'</span>
aws ec2 create-subnet \
  --availability-zone us-east-1b \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.102.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PublicSubnet2}]'</span>
aws ec2 create-subnet \
  --availability-zone us-east-1c \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.103.0/24 \
  --tag-specifications <span class="hljs-string">'ResourceType=subnet, Tags=[{Key=Name,Value=gke-cluster-PublicSubnet3}]'</span>

PUBLIC_SUBNET_ID_1=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicSubnet1'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
PUBLIC_SUBNET_ID_2=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicSubnet2'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
PUBLIC_SUBNET_ID_3=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicSubnet3'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
</code></pre>
<p>At this point we’ve just created the subnets, but we haven’t made them public. To do this we need to modify the <code>map-public-ip-on-launch</code> attribute of each subnet:</p>
<pre><code class="lang-bash">aws ec2 modify-subnet-attribute \
  --map-public-ip-on-launch \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_1</span>
aws ec2 modify-subnet-attribute \
  --map-public-ip-on-launch \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_2</span>
aws ec2 modify-subnet-attribute \
  --map-public-ip-on-launch \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_3</span>
</code></pre>
<p>Next step is to create an internet gateway for the VPC:</p>
<pre><code class="lang-bash">aws --region us-east-1  ec2 create-internet-gateway \
  --tag-specifications <span class="hljs-string">'ResourceType=internet-gateway, Tags=[{Key=Name,Value=gke-cluster-InternetGateway}]'</span>
INTERNET_GW_ID=$(aws ec2 describe-internet-gateways \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-InternetGateway'</span> \
  --query <span class="hljs-string">"InternetGateways[].InternetGatewayId"</span> --output text)
aws ec2 attach-internet-gateway \
  --internet-gateway-id <span class="hljs-variable">$INTERNET_GW_ID</span> \
  --vpc-id <span class="hljs-variable">$VPC_ID</span>
</code></pre>
<p>You're probably wondering at this point why AWS networking is so much harder than Google Cloud! Well guess what, we're not even closed to being done 😂</p>
<p>We now have a ton of groundwork to lay in the form of route tables. We start with our new public subnets, creating routing tables for each an then associating them with the corresponding subnet:</p>
<pre><code class="lang-bash">aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PublicRouteTbl1}]'</span>
aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PublicRouteTbl2}]'</span>
aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PublicRouteTbl3}]'</span>

PUBLIC_ROUTE_TABLE_ID_1=$(aws ec2 describe-route-tables \
    --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicRouteTbl1'</span> \
    --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)
PUBLIC_ROUTE_TABLE_ID_2=$(aws ec2 describe-route-tables \
    --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicRouteTbl2'</span> \
    --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)
PUBLIC_ROUTE_TABLE_ID_3=$(aws ec2 describe-route-tables \
    --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PublicRouteTbl3'</span> \
    --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)

aws ec2 associate-route-table \
  --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_1</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_1</span>
aws ec2 associate-route-table \
  --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_2</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_2</span>
aws ec2 associate-route-table \
  --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_3</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_3</span>
</code></pre>
<p>Public subnets also need default routes for the internet gateway:</p>
<pre><code class="lang-bash">aws ec2 create-route --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_1</span> \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$INTERNET_GW_ID</span>
aws ec2 create-route --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_2</span> \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$INTERNET_GW_ID</span>
aws ec2 create-route --route-table-id <span class="hljs-variable">$PUBLIC_ROUTE_TABLE_ID_3</span> \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$INTERNET_GW_ID</span>
</code></pre>
<p>The final component for the public subnets is a NAT gateway, that will NAT traffic to the private subnets. These gateways require public IP addresses, so first we assign some elastic IPs, one for each gateway, then we create the NAT gateway for each public subnet:</p>
<pre><code class="lang-bash">aws ec2 allocate-address \
  --tag-specifications <span class="hljs-string">'ResourceType=elastic-ip, Tags=[{Key=Name,Value=gke-cluster-NatEip1}]'</span>
aws ec2 allocate-address \
  --tag-specifications <span class="hljs-string">'ResourceType=elastic-ip, Tags=[{Key=Name,Value=gke-cluster-NatEip2}]'</span>
aws ec2 allocate-address \
  --tag-specifications <span class="hljs-string">'ResourceType=elastic-ip, Tags=[{Key=Name,Value=gke-cluster-NatEip3}]'</span>

NAT_EIP_ALLOCATION_ID_1=$(aws ec2 describe-addresses \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatEip1'</span> \
  --query <span class="hljs-string">"Addresses[].AllocationId"</span> --output text)
NAT_EIP_ALLOCATION_ID_2=$(aws ec2 describe-addresses \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatEip2'</span> \
  --query <span class="hljs-string">"Addresses[].AllocationId"</span> --output text)
NAT_EIP_ALLOCATION_ID_3=$(aws ec2 describe-addresses \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatEip3'</span> \
  --query <span class="hljs-string">"Addresses[].AllocationId"</span> --output text)

aws ec2 create-nat-gateway \
  --allocation-id <span class="hljs-variable">$NAT_EIP_ALLOCATION_ID_1</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_1</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=natgateway, Tags=[{Key=Name,Value=gke-cluster-NatGateway1}]'</span>
aws ec2 create-nat-gateway \
  --allocation-id <span class="hljs-variable">$NAT_EIP_ALLOCATION_ID_2</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_2</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=natgateway, Tags=[{Key=Name,Value=gke-cluster-NatGateway2}]'</span>
aws ec2 create-nat-gateway \
  --allocation-id <span class="hljs-variable">$NAT_EIP_ALLOCATION_ID_3</span> \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_ID_3</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=natgateway, Tags=[{Key=Name,Value=gke-cluster-NatGateway3}]'</span>
</code></pre>
<p>Moving onto the private subnets, each of these will also need a route table. We'll create them, assign the ID to an environment variable, and associate each table with its corresponding subnet:</p>
<pre><code class="lang-bash">aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PrivateRouteTbl1}]'</span>
aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PrivateRouteTbl2}]'</span>
aws ec2 create-route-table --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --tag-specifications <span class="hljs-string">'ResourceType=route-table, Tags=[{Key=Name,Value=gke-cluster-PrivateRouteTbl3}]'</span>

PRIVATE_SUBNET_ID_1=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateSubnet1'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
PRIVATE_SUBNET_ID_2=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateSubnet2'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
PRIVATE_SUBNET_ID_3=$(aws ec2 describe-subnets \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateSubnet3'</span> \
  --query <span class="hljs-string">"Subnets[].SubnetId"</span> --output text)
PRIVATE_ROUTE_TABLE_ID_1=$(aws ec2 describe-route-tables \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateRouteTbl1'</span> \
  --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)
PRIVATE_ROUTE_TABLE_ID_2=$(aws ec2 describe-route-tables \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateRouteTbl2'</span> \
  --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)
PRIVATE_ROUTE_TABLE_ID_3=$(aws ec2 describe-route-tables \
  --filters <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-PrivateRouteTbl3'</span> \
  --query <span class="hljs-string">"RouteTables[].RouteTableId"</span> --output text)

aws ec2 associate-route-table --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_1</span> \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_ID_1</span>
aws ec2 associate-route-table --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_2</span> \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_ID_2</span>
aws ec2 associate-route-table --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_3</span> \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_ID_3</span>
</code></pre>
<p>We're so close! The final stage is to create default routes to the NAT gateways for our private subnets:</p>
<pre><code class="lang-bash">NAT_GW_ID_1=$(aws ec2 describe-nat-gateways \
 --filter <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatGateway1'</span> \
 --query <span class="hljs-string">"NatGateways[].NatGatewayId"</span> --output text)
NAT_GW_ID_2=$(aws ec2 describe-nat-gateways \
 --filter <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatGateway2'</span> \
 --query <span class="hljs-string">"NatGateways[].NatGatewayId"</span> --output text)
NAT_GW_ID_3=$(aws ec2 describe-nat-gateways \
 --filter <span class="hljs-string">'Name=tag:Name,Values=gke-cluster-NatGateway3'</span> \
 --query <span class="hljs-string">"NatGateways[].NatGatewayId"</span> --output text)

aws ec2 create-route --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_1</span>  \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$NAT_GW_ID_1</span>
aws ec2 create-route --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_2</span>  \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$NAT_GW_ID_2</span>
aws ec2 create-route --route-table-id <span class="hljs-variable">$PRIVATE_ROUTE_TABLE_ID_3</span> \
  --destination-cidr-block 0.0.0.0/0 --gateway-id <span class="hljs-variable">$NAT_GW_ID_3</span>
</code></pre>
<p>Again, hats off to you if you work with AWS networking on a daily basis. I hope you get plenty of paid leave! But seriously, even though there's a ton of foundations to lay here, hopefully you only have to do it once.</p>
<h3 id="heading-actually-finally-building-a-cluster-in-aws">Actually, finally, building a cluster in AWS</h3>
<p>This is the moment we've been waiting for! Just note that the commands below will reference a lot of the environment variables that were created in the stages above, so make sure these still exist in your terminal session before you proceed. Building the cluster is a two step process - control plane first, then worker nodes.</p>
<p>Let's start with the cluster and its control plane:</p>
<pre><code class="lang-bash">gcloud container aws clusters create aws-cluster \
  --cluster-version 1.26.2-gke.1001 \
  --aws-region us-east-1 \
  --location us-east4 \
  --fleet-project <span class="hljs-variable">$PROJECT_ID</span> \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --subnet-ids <span class="hljs-variable">$PRIVATE_SUBNET_ID_1</span>,<span class="hljs-variable">$PRIVATE_SUBNET_ID_2</span>,<span class="hljs-variable">$PRIVATE_SUBNET_ID_3</span> \
  --pod-address-cidr-blocks 10.2.0.0/16 \
  --service-address-cidr-blocks 10.1.0.0/16 \
  --role-arn <span class="hljs-variable">$API_ROLE_ARN</span> \
  --iam-instance-profile <span class="hljs-variable">$CONTROL_PLANE_PROFILE</span> \
  --database-encryption-kms-key-arn <span class="hljs-variable">$KMS_KEY_ARN</span> \
  --config-encryption-kms-key-arn <span class="hljs-variable">$KMS_KEY_ARN</span> \
  --tags google:gkemulticloud:cluster=aws-cluster
</code></pre>
<p>As you can see, there are many different options that we pass into the command. Notably, we can specify a GKE version (<code>1.26.2-gke.1001</code> in this example). We also need to provide the AWS region to build the cluster in with the <code>--aws-region</code> option, as well as the Google Cloud region where the Connect API should run. This isn’t available in all Google Cloud locations, so in this example, <code>us-east4</code> is the closest location to our AWS region where the Connect API is available.</p>
<p>GKE will leverage the Multi-cloud API along with the IAM and KMS configuration you have set up in AWS to connect to the AWS API and build the control plane of your cluster. This should take about 5 minutes, after which your cluster should be shown in the GKE Enterprise cluster list in the Google Cloud console. Note that at this point you may see a warning next to the cluster – don't worry, we’ll fix that in a moment.</p>
<p>Next, we build a node pool for our new cluster – this doesn’t happen automatically like when we build a cluster on Google Cloud. To do this we specify a minimum and a maximum number of nodes along with other options such as the root volume size for each virtual machine, and the subnet ID where they should be hosted:</p>
<pre><code class="lang-bash">gcloud container aws node-pools create pool-0 \
  --location us-east4 \
  --cluster aws-cluster \
  --node-version 1.26.2-gke.1001 \
  --min-nodes 1 \
  --max-nodes 5 \
  --max-pods-per-node 110 \
  --root-volume-size 50 \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_ID_1</span> \
  --iam-instance-profile <span class="hljs-variable">$NODE_POOL_PROFILE</span> \
  --config-encryption-kms-key-arn <span class="hljs-variable">$KMS_KEY_ARN</span> \
  --ssh-ec2-key-pair gke-key \
  --tags google:gkemulticloud:cluster=aws-cluster
</code></pre>
<p>If you don't see a green tick next to your cluster on the GKE clusters page, scroll all the way to the right of your new AWS cluster and click the three dots (Google calls this an <em>actions</em> menu). Select <strong>Log in</strong> and use the option to authenticate with your Google credentials. This authenticates your console session and creates the bridge from GKE to the Kubernetes API and AWS API. After a few moments, you should see a green tick which means your AWS cluster is now being managed by GKE!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725447824205/ed73cbcc-3b9f-4d28-98e8-1efc0582f063.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-testing-the-aws-cluster">Testing the AWS cluster</h3>
<p>Interacting with your new AWS cluster works the same way as if you were working with any other Kubernetes cluster. Even though we've built this cluster in AWS, we're managing it via GKE Enterprise, so we can use <code>gcloud</code> to get credentials for the Kubernetes API and store them in our local <code>kubeconfig</code> file:</p>
<pre><code class="lang-bash">gcloud container aws clusters get-credentials aws-cluster --location us-east4
</code></pre>
<p>Once you’ve done this, you can issue normal <code>kubectl</code> commands, for example, to see the state of the cluster’s nodes: <code>kubectl get nodes</code>.</p>
<p>Let’s run the following commands to run a test deployment and expose it via a Load Balancer, just like we did with our Google GKE cluster earlier:</p>
<pre><code class="lang-bash">kubectl create deployment hello-server --image=us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
kubectl expose deployment hello-server --<span class="hljs-built_in">type</span> LoadBalancer --port 80 --target-port 8080
</code></pre>
<p>It will take a few minutes for the Load Balancer to get set up. Unlike exposing a Load Balancer service in Google Cloud, our AWS Kubernetes cluster will assign the service a hostname rather than an external IP. You can get the hostname with this command:</p>
<pre><code class="lang-bash">kubectl get service hello-server
</code></pre>
<p>You should now be able to load the hostname in your browser and see that your deployment is working on AWS! If it doesn’t work straight away, wait a few minutes while AWS configures the Load Balancer and try again.</p>
<h3 id="heading-deleting-the-aws-cluster">Deleting the AWS cluster</h3>
<p>While deleting a cluster hosted in Google Cloud is quite straightforward, deleting an AWS cluster requires separate steps for the node pools and control plane. Assuming your names and locations match the instructions we followed previously to create the cluster, first delete the node pool with this command:</p>
<pre><code class="lang-bash">gcloud container aws node-pools delete pool-0 --cluster aws-cluster --location us-east4
</code></pre>
<p>When that command has been completed, we can finally delete the control plane with this command:</p>
<pre><code class="lang-bash">gcloud container aws clusters delete aws-cluster --location us-east4
</code></pre>
<p>This will delete the EC2 virtual machines that were used to provide the cluster. However...</p>
<h3 id="heading-deleting-other-aws-resources">Deleting other AWS resources</h3>
<p>Even though you’ve deleted the EC2 virtual machines, your AWS account will still contain the other resources we created for the VPC and IAM configurations, some of which may continue to incur fees (such as the elastic IP assignments). There’s not enough room in this post to detail how to delete everything, but if you want to be sure that you don’t incur any further charges, walk back through the commands we used to create the resources and remove each resource individually, either using the <code>aws</code> command line or the web console.</p>
<h2 id="heading-summary">Summary</h2>
<p>We’ve demonstrated how to build a Kubernetes cluster in AWS that can be managed by GKE, and hopefully you now have a good understanding of the requirements for using GKE in AWS. Google Cloud and AWS are often used together by organisations to provide complementary services, extra redundancy, or availability in different geographic regions, so understanding how to leverage GKE across both vendors should prove very useful.</p>
<p>We're hopefully starting to gain an understanding of the purpose of GKE Enterprise, and what makes it different from working with standard GKE clusters. We learned how the Connect API allows GKE to communicate with other clouds and external Kubernetes clusters, and we started to look at the requirements for building GKE in environments that are external to Google Cloud. GKE guarantees that we are using conformant versions of Kubernetes, so under the hood we always have the same technologies and components to build from, even if a different vendor’s approach to cloud networking requires us to make a few changes. And this is the beauty of using open standards and an open-source project like Kubernetes.</p>
<p>Congratulations if you made it to the end of this very, very long post! 🎉 (Thanks, designers of AWS networking concepts). In the next post, we'll look at how to build GKE Enterprise on bare metal and VMWare environments.</p>
]]></content:encoded></item><item><title><![CDATA[Introducing GKE Enterprise]]></title><description><![CDATA[This is the first in a new series of long-form posts exploring the features of GKE Enterprise, formerly known as Anthos. GKE Enterprise is an additional subscription service for GKE that adds configuration and policy management, service mesh and othe...]]></description><link>https://timberry.dev/introducing-gke-enterprise</link><guid isPermaLink="true">https://timberry.dev/introducing-gke-enterprise</guid><category><![CDATA[gke]]></category><category><![CDATA[gke-enterprise]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Tue, 27 Aug 2024 09:44:17 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729603253428/ae1905b8-a933-4efb-9a90-e94ca13911c9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the first in a new series of long-form posts exploring the features of GKE Enterprise, formerly known as Anthos.</em> <a target="_blank" href="https://cloud.google.com/kubernetes-engine/enterprise/docs/concepts/overview"><em>GKE Enterprise</em></a> <em>is an additional subscription service for GKE that adds configuration and policy management, service mesh and other features to support running Kubernetes workloads in Google Cloud, on other clouds and even on-premises. Over the coming months, I'll try to explore all of these features in an easy-to-digest way.</em></p>
<h2 id="heading-introduction">Introduction</h2>
<p>Kubernetes is one of the most successful open-source software stories in the history of computing. Its rapid development and adoption can be attributed to the fact that it solves numerous problems that have faced developers and systems administrators for many years. If you have worked in technology for over 10 years, you may recall previous attempts to solve the fundamental problems of deploying, scaling, and managing software, which took various forms from simple scripting to package and configuration management systems, each with its own quirks and compromises. As these attempted solutions were developed, the problems themselves got more complex, as demand for software and services increased by orders of magnitude, complicating the way that we build and deploy software over distributed systems, often in other people’s datacenters.</p>
<p>As a way to package software, container technology was quickly adopted thanks to the developer-friendly toolset developed by Docker, but a logical and consistent way to orchestrate containers was still missing until Kubernetes. Based on Google’s extensive experience running software at an enormous scale over thousands of commodity servers, Kubernetes finally gave us a way to logically deploy, scale, and manage containerized software.</p>
<p>But progress never stops, and with it, complexity always increases. We’ve moved past the early days of mass cloud adoption and we live in a world where complex distributed systems need to run reliably on any combination of cloud or on-premises platforms. These new problems have mandated new solutions and new technologies such as service mesh and a distributed approach to authorization and authentication. Fully managed Kubernetes services evolved to support these new challenges with the launch of Google Cloud’s Anthos, which has now matured into GKE Enterprise.</p>
<p>Of course, not all deployments need to be this complex, and if you’re lucky enough to be working on small-scale projects with moderately sized clusters, you may not need all of these extra features. You certainly should not try to apply over-engineered solutions to every problem, just for the sake of playing with the latest cool new technologies. One of the approaches I want to take in this series is to identify the use case for every new tool, so you can appreciate where – and where not – to use it.</p>
<p>This brings us to our overall objective. Despite the increased complexity I’ve described, thankfully these new technologies are easy enough to integrate into your projects because they operate within Kubernetes’ fundamentally logical model: a clear set of APIs and a declarative approach to using them.</p>
<p>In this series, we’ll explore each advanced feature in a sensible order, explaining what it's for and how you can use it in an approachable, no-nonsense way. We’ll start by revisiting some core Kubernetes topics to set the scene for what we’re going to learn, but we’re not going to teach Kubernetes from scratch. If you don’t already have Kubernetes experience under your belt, you might want to get some more practice in before exploring GKE Enterprise!</p>
<h2 id="heading-a-quick-recap-kubernetes-architecture">A quick recap: Kubernetes Architecture</h2>
<p>Like I said, you should already be reasonably familiar with Kubernetes, so we’re not going to deep dive into its architectural nuts and bolts here. However, the topic is worth revisiting to explain the limits of the architecture that the core Kubernetes project cares about, where those sit within managed environments, and how the advanced features of GKE Enterprise extend them to other architectures.</p>
<p>To recap, Kubernetes generally runs on clusters of computers. You can of course run all of its components on a single computer if you wish, but this would not be considered a production environment.</p>
<p>Computers in a Kubernetes cluster are referred to as <em>nodes</em>. These can be provided by physical (bare metal) machines or virtual machines. Nodes provide the raw computing power and the container runtimes to actually run our containerized workloads. Nodes that only run our workloads are often referred to as worker nodes, and there are mechanisms that exist to group worker nodes of the same configuration together. However, we also need some computers to run the Kubernetes control plane.</p>
<p>The <em>control plane</em> can be thought of as Kubernetes’ brain to a certain extent. The control plane’s components make decisions about the cluster, store its state, and detect and respond to events. For example, the control plane will receive instructions describing a new workload and decide where the workload should be run. In production environments, it's common to isolate the control plane to its own set of computers, or even run multiple copies of a control plane for high availability.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723641751375/85d84442-7fe6-4e75-8301-6ef60c59c197.png" alt class="image--center mx-auto" /></p>
<p>As a reminder, the control plane runs these components:</p>
<ul>
<li><p><strong>kube-apiserver</strong>: The front-end of the control plane, exposing the Kubernetes API.</p>
</li>
<li><p><strong>etcd</strong>: The consistent key-value store for all cluster data such as configuration and state.</p>
</li>
<li><p><strong>kube-scheduler</strong>: A watch loop that keeps an eye out for workloads without an assigned node, so that it can assign one to them. Influencing the scheduler is an important but complex topic.</p>
</li>
<li><p><strong>kube-controller-manager</strong>: The component that runs controller processes. These are additional watch loops for tasks like monitoring nodes and managing workloads.</p>
</li>
<li><p><strong>cloud-controller-manager</strong>: If you’re running your cluster in the cloud, this component links Kubernetes to your cloud provider’s APIs.</p>
</li>
</ul>
<p>Each worker node will run these components:</p>
<ul>
<li><p><strong>kubelet</strong>: Kubernetes’ agent on each node and the point of contact for the API server. The kubelet agent also manages the container runtime to ensure the workloads it knows about are actually running.</p>
</li>
<li><p><strong>kube-proxy</strong>: A network proxy to implement Kubernetes service concepts (more on those in a moment).</p>
</li>
<li><p><strong>Container runtime</strong>: The underlying container runtime responsible for actually running containers (such as containerd, CRI-O, or other supported runtime).</p>
</li>
</ul>
<p>Remember that Kubernetes is <em>software that runs on your nodes</em>. Kubernetes itself does not manage the actual nodes, whether they are physical or virtual. If you build Kubernetes for yourself, you will need to provision those computers and operating systems, and then install the Kubernetes software, although the Kubernetes project does provide tools like kubeadm to help you do this. When Kubernetes is running, it can only communicate with its other components and can’t directly manage a node, although the node controller component does at least notice if a node goes down.</p>
<p>It’s for this reason that managed services for Kubernetes have gained so much popularity. Services such as <strong>Google Kubernetes Engine</strong> (<strong>GKE</strong>) provide management of the <em>infrastructure</em> required to run Kubernetes as well as the software. For example, GKE can deploy nodes, group them together, manage and update the Kubernetes software on them, auto-scale them and even heal them if they become unhealthy. But this is an important distinction – GKE is managing your infrastructure, while Kubernetes is running on that infrastructure and managing your application workloads.</p>
<p>Within standard Kubernetes and most managed services, your Kubernetes cluster is the perimeter of your service, although of course you can have many clusters, all separately managed. As we’ll learn soon, GKE Enterprise allows us to extend those boundaries and manage multiple clusters in any number of different platforms, all with a single management view and development experience. But before we get into those topics, let’s refresh our memory on some more basic Kubernetes principles. If you’re a Kubernetes expert and you’re just here for the GKE Enterprise specifics, feel free to skip this section!</p>
<h2 id="heading-fundamental-kubernetes-objects">Fundamental Kubernetes Objects</h2>
<p>Let's revisit some fundamental Kubernetes objects, because we need a solid understanding of these before we learn how to extend them with GKE Enterprise’s advanced features in future posts. If you live and breathe Kubernetes on a daily basis, feel free to skip this section! Alternatively, if you're not comfortable with any of them, a good place to start is <a target="_blank" href="https://kubernetes.io/docs/concepts/overview/working-with-objects/">https://kubernetes.io/docs/concepts/overview/working-with-objects/</a></p>
<h3 id="heading-pods">Pods</h3>
<p><strong>Pods</strong> are the smallest deployable object on Kubernetes, representing an application-centric <em>logical host</em> for one or more containers. Arguably, their advanced patterns for colocating sidecar, init, and other patterns of containers provided the flexibility that pushed Kubernetes ahead of other orchestration platforms. The key design principle of Pods is that they are the atomic unit of scale. In general, you increase the number of Pods (not the number of containers inside a Pod) to increase capacity. Aside from some specific circumstances, it’s very rare to deploy an individual Pod on its own. Logical controllers such as Deployments and StatefulSets are much more useful.</p>
<h3 id="heading-replicasets">ReplicaSets</h3>
<p><strong>ReplicaSets</strong> are sets of identical Pods, where each Pod in the set shares exactly the same configuration. ReplicaSets pre-date other controller objects, and even though we never explicitly create a ReplicaSet on its own, it is worth understanding their purpose within those other logical objects. You can think of a ReplicaSet as a version of a configuration. If you change a Deployment for example, a new ReplicaSet is created to represent that new version. Because ReplicaSets are just versions of configuration, we can keep a history of them in the Kubernetes data store, and easily roll back versions.</p>
<h3 id="heading-deployments">Deployments</h3>
<p><strong>Deployments</strong> are one of the most commonly used controller objects in Kubernetes. Controller objects are useful because the control plane runs a watch loop to ensure that the things you are asking for in your object actually exist. This is the fundamental concept of declarative configuration in Kubernetes. Individual Pods, for example, may come and go, and to a certain extent, the control plane doesn’t care. But if your Deployment object states that you should have 5 Pods of a specific configuration, the Deployment controller which make sure that this remains true.</p>
<p>Deployments are designed specifically for <strong>stateless workloads</strong>. This means that the container itself must not attempt to store or change any state locally and that each Pod in the Deployment must be an anonymous identical replica. Of course, you can still affect the state as long as it lives somewhere else, such as in a networked storage service or database. Stateless deployments offer you the most flexibility because your Pods can be scaled – effectively, deleted, and recreated – at will.</p>
<h3 id="heading-statefulsets">StatefulSets</h3>
<p>For workloads that are required to manage state, a <strong>StatefulSet</strong> provides most of the logic of the Deployment object but also introduces some guarantees. Pods are scaled up and down in a logical order, do not have to be identical, and are guaranteed a unique identity. A common design pattern for stateful deployments is to have the first Pod serve as a controller and additional Pods act as workers. Each Pod can determine its own purpose based on its identity and local hostname. Additionally, storage volumes can be attached to Pods on a one-to-one basis.</p>
<h3 id="heading-services">Services</h3>
<p>With Pods coming and going all the time, and workloads being served by multiple Pods at any given time, we need a way to provide a fixed network endpoint. This is the purpose of the Service object. Services come in a few different forms, but all of them provide a fixed internal IP available to the cluster that can route traffic in a round-robin fashion to a group of Pods. The group of Pods is determined by a label selector. For example, if your Service is configured to route traffic to Pods that match the labels <code>app=web</code> and <code>env=prod</code>, all Pods that contain these labels in their metadata will be included in the round-robin group. It’s a very low-level object that we rarely use on its own, but it’s important to understand.</p>
<h3 id="heading-ingress">Ingress</h3>
<p>Service objects have been used historically to provide external access to workloads running in a Kubernetes cluster, often through the <code>LoadBalancer</code> type of Service, which attaches an external IP address or an external service, depending on where you’re running your cluster. However, there are limitations to using a low-level object that provides such a direct mapping of traffic, so the Ingress object was created.</p>
<p>The <strong>Ingress</strong> object is designed for HTTP and HTTPS traffic and offers more features and flexibility than the simple Service, such as the ability to route traffic to different back-end services based on request paths. Ingress objects themselves are used in conjunction with an Ingress Controller, which is typically a Deployment of some kind of proxy software, such as the NGINX web server. Depending on which server you use as your Ingress Controller, you can also add annotations to your Ingress objects, for example, to invoke NGINX rewrite rules.</p>
<p>The Ingress pattern has been extremely useful and popular, but due to its limited focus on HTTP the feature has now been frozen, meaning it will not receive any further development. Instead, it has inspired a new approach that takes the same flexible design ideas from Ingress but handles all kinds of traffic.</p>
<h3 id="heading-gateway-api">Gateway API</h3>
<p>The Gateway API is the project that enables this approach. Notably, this is an add-on for Kubernetes, not part of its default supported API groups. Gateway API provides three new custom API resources:</p>
<ul>
<li><p><strong>Gateway</strong>: An object representing how an external request should be translated within the Kubernetes cluster. A Gateway resource specifies protocols and ports and can also offer granular control over allowed routes and namespaces.</p>
</li>
<li><p><strong>HTTPRoute</strong>: An object representing how traffic coming through a gateway should be directed to a back-end service, including paths and rules to match. Other protocol-specific objects (such as GRPCRoute) are also available.</p>
</li>
<li><p><strong>GatewayClass</strong>: An object representing the controller that provides the actual functionality for gateways, similar to the Ingress Controller we described earlier. Gateways must reference a GatewayClass.</p>
</li>
</ul>
<p>So we've refreshed our memory on some of the most commonly used Kubernetes objects. Of course, there are many more that we haven’t touched on, however, it was important to go back to basics and build up to the Gateway API as it’s a key component of many GKE Enterprise features such as service mesh. Next, to understand where GKE Enterprise fits in with the portfolio of Google Cloud services, let’s discuss the different editions of GKE available.</p>
<h2 id="heading-anthos-multi-cloud-and-gke-editions">Anthos, Multi-cloud and GKE Editions</h2>
<p><strong>Google Kubernetes Engine</strong> (<strong>GKE</strong>) has evolved through many iterations over the years. As we previously mentioned, Google’s own internal container orchestration work was what originally inspired the open-source project Kubernetes, which was released to the world in June 2014. Google’s cloud platform at the time had a handful of core services, but fast forward to 2017, and something called <strong>Google Container Engine</strong> was launched at Google’s annual Next conference.</p>
<p>This new service promised to take all of the convenience and power of the Kubernetes project and apply a fully managed service for the infrastructure required to run it, for example by auto-scaling and auto-healing the virtual machines required to run a cluster. It started with a fantastic success story – the global phenomenon of Pokemon Go had recently launched with all of its backend service running on Google Container Engine (with lots of help from Google engineers). Later the same year, the service was renamed more accurately as Google Kubernetes Engine and then certified by the Cloud Native Computing Foundation as a certified Kubernetes platform, guaranteeing its compatibility with open-source Kubernetes implementations.</p>
<p>Over the following years, Google continued to improve GKE by offering the latest versions of Kubernetes as part of their managed service. During this time Google Cloud moved to position itself as a vendor that supported hybrid and multi-cloud portability and began to develop extensions to GKE that allowed it to run nodes on computers that ran in locations other than their own cloud. This new service launched under the product name of <strong>Anthos</strong> in 2019.</p>
<p>Anthos, as it turns out, was more of a name for a subscription service that you had to pay for rather than an individual product. Paying for Anthos gave you access to new features including <strong>GKE On-Prem</strong> and <strong>Anthos Migrate</strong>. GKE On-Prem allowed you to run Kubernetes clusters on bare-metal and VMWare servers which were managed by GKE. Anthos Migrate would (in theory) allow you to migrate virtual machines to stateful Pods on a GKE cluster. Anthos was rough around the edges on launch, but soon added even more useful features such as a managed Istio service mesh, a managed KNative implementation (known as <strong>Cloud Run for Anthos</strong>), policy agents, and more.</p>
<p>Over the years more features were developed for Anthos, and occasionally they would drift into the core GKE service – and back out again. This occurred for the service mesh features as well as a new service for binary authorization of containers. Support for running clusters in AWS and Azure was added, causing some confusion with the “On-Prem” product name. Anthos rapidly grew into an extremely powerful platform for running GKE anywhere, but its original product name now seemed somewhat incongruous.</p>
<h3 id="heading-from-anthos-to-gke-enterprise">From Anthos to GKE Enterprise</h3>
<p>So towards the end of 2023, Google announced the launch of GKE Enterprise, which essentially means two things. Firstly, it’s a normalization of everything Anthos was but under the proper GKE product banner. Secondly, it’s a quiet deprecation of the Anthos brand. For most intents and purposes, particularly when it comes to features, technology, or documentation, you can consider the two terms as interchangeable. The main difference is that now Anthos is no longer a separate thing adjacent to GKE, it is an <em>edition of GKE</em>. GKE Enterprise to be exact!</p>
<p>Now, when choosing how to use GKE as a platform, you now have the two options of <strong>GKE Standard</strong> and <strong>GKE Enterprise</strong>. At a very high level, GKE Standard provides an advanced managed Kubernetes offering within Google Cloud, while GKE Enterprise extends the offering across multiple clouds and bare metal and adds additional features.</p>
<p>The most important features are presented for comparison in the table below:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>GKE Standard</td><td>GKE Enterprise</td></tr>
</thead>
<tbody>
<tr>
<td>Fully automated cluster lifecycle management</td><td>Fully automated cluster lifecycle management</td></tr>
<tr>
<td>Certified open-source compliant Kubernetes</td><td>Certified open-source compliant Kubernetes</td></tr>
<tr>
<td>Workload and cluster auto-scaling</td><td>Workload and cluster auto-scaling</td></tr>
<tr>
<td></td><td>GitOps-based configuration management</td></tr>
<tr>
<td></td><td>Managed service mesh</td></tr>
<tr>
<td></td><td>Managed policy controller</td></tr>
<tr>
<td></td><td>Fleet management</td></tr>
<tr>
<td></td><td>GKE on AWS, Azure, VMWare and Bare Metal</td></tr>
<tr>
<td></td><td>Multi-cluster ingress</td></tr>
<tr>
<td></td><td>Binary authorization</td></tr>
</tbody>
</table>
</div><p>Pricing changes all the time in the competitive world of cloud computing, but at the time of writing GKE Standard charges a fee per cluster hour, while GKE Enterprise charges a fee per CPU hour.</p>
<h3 id="heading-what-about-autopilot">What about Autopilot?</h3>
<p>A separate development in the world of GKE was the launch of <strong>Autopilot</strong> in 2021, an attempt to bring some of the convenience of the serverless world to Kubernetes. When creating a normal GKE cluster you must make decisions about the type, size, and number of your nodes. You may or may not enable node-autoscaling, but you normally have to continue to manage the overall capacity of your cluster during its lifetime.</p>
<p>With Autopilot, a cluster is fully managed for you behind the scenes. You no longer have the need to worry about nodes, as they are automatically scaled to the correct size for you. All you have to do is supply your workload configurations (for example, Deployments), and Autopilot will make all the necessary decisions to run them for you.</p>
<p>Confusingly, this means that GKE Standard is now an edition of GKE, as well as a <em>type of cluster</em>. A GKE Standard cluster is a cluster where you manage the nodes as well as the workloads, and these clusters can run in either edition of GKE, Standard, or Enterprise. Autopilot clusters are also supported in both the Standard and Enterprise editions of GKE. However, at the time of writing, Autopilot clusters are only supported inside Google Cloud.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723644002021/953b9a31-b554-4bbf-8b9b-0194b5c70b32.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-summary">Summary</h2>
<p>In this first post, we have refreshed our memories on the basic architecture of a Kubernetes cluster and some of the fundamental component objects that are used to build workloads and services. Having these concepts fresh in our memory will be useful as we introduce the advanced features of GKE Enterprise throughout the rest of the series.</p>
<p>We have also clarified the different versions and editions of GKE, and where Anthos fits into the equation. Hopefully, this all makes sense now, despite the best efforts of Google’s product managers! Importantly, we’ve established what makes GKE Enterprise different, and now we’re ready to start learning how to use it. In the next post, we’ll begin by enabling GKE Enterprise and configuring our first GKE clusters.</p>
]]></content:encoded></item><item><title><![CDATA[Adding an SQLite backend to FastAPI]]></title><description><![CDATA[While recently migrating my blog (again), I've revisited some posts including my tutorial: A simple Python FastAPI template with API key authentication.
That tutorial set out a very basic template for a FastAPI app that used API keys, but to keep it ...]]></description><link>https://timberry.dev/adding-an-sqlite-backend-to-fastapi</link><guid isPermaLink="true">https://timberry.dev/adding-an-sqlite-backend-to-fastapi</guid><category><![CDATA[FastAPI]]></category><category><![CDATA[Python]]></category><category><![CDATA[SQLite]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Fri, 16 Aug 2024 14:36:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723817919840/f5d32a5a-809a-427b-89af-8cfc7c11a703.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While recently migrating my blog (again), I've revisited some posts including my tutorial: <a target="_blank" href="https://timberry.dev/fastapi-with-apikeys">A simple Python FastAPI template with API key authentication</a>.</p>
<p>That tutorial set out a very basic template for a FastAPI app that used API keys, but to keep it simple it used hard-coded database functions that simply checked for API keys in a local Dictionary. But my plan was always to extrapolate from that first step however, and slowly improve the template. In this post, we'll actually implement a database lookup function using <a target="_blank" href="https://www.sqlite.org/index.html">SQLite3</a>. If you want to follow along, just grab the code from the previous post.</p>
<p>The first thing we'll do is actually create a database. SQLite3 is rarely used in production systems (although sometimes it is!) but hopefully you can see from this example how easy it would be to start building something similar using Postgres for example. Within the <code>app</code> directory (remember we're using code from the previous post), create a new database file with this command:</p>
<pre><code class="lang-bash">sqlite3 db.sqlite
</code></pre>
<p>We can now populate the database. We'll keep it simple by creating a <code>users</code> table to store the API keys for Bob and Alice, our test users from before:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span>(userid <span class="hljs-built_in">text</span>, <span class="hljs-keyword">name</span> <span class="hljs-built_in">text</span>, apikey <span class="hljs-built_in">text</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">VALUES</span>(<span class="hljs-string">'7oDYjo3d9r58EJKYi5x4E8'</span>,<span class="hljs-string">'Bob'</span>,<span class="hljs-string">'e54d4431-5dab-474e-b71a-0db1fcb9e659'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">VALUES</span>(<span class="hljs-string">'mUP7PpTHmFAkxcQLWKMY8t'</span>,<span class="hljs-string">'Alice'</span>,<span class="hljs-string">'5f0c7127-3be9-4488-b801-c7b6415b45e9'</span>);
</code></pre>
<h3 id="heading-dbpy">db.py</h3>
<p>If you recall from the previous post, our <code>db.py</code> file contained hard-coded user IDs and API keys, and then two functions. <code>check_api_key</code> would check for the existence of the API key in the "database" (which was really just a Dictionary), and <code>get_user_from_api_key</code> would then retrieve the user object itself. I used this two-step approach to demonstrate the logic of things, but seeing as we're upgrading to using a real database, let's also make this more efficient.</p>
<p>We'll replace <code>db.py</code> entirely with the following:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> sqlite3

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_api_key</span>(<span class="hljs-params">api_key: str</span>):</span>
    <span class="hljs-keyword">with</span> sqlite3.connect(<span class="hljs-string">'db.sqlite'</span>) <span class="hljs-keyword">as</span> conn:
        cur = conn.cursor()
        cur.execute(<span class="hljs-string">'select name from users where apikey = ?'</span>, [api_key])
        row = cur.fetchone()
        <span class="hljs-keyword">if</span> row:
            <span class="hljs-keyword">return</span>(row)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>
</code></pre>
<p>Now we just have a <code>check_api_key</code> function. It makes the connection to the database, and searches for a row based on the provided API key. If an API key matches, it returns that row (in other words, the user object).</p>
<p>If no API key matches, we return <code>False</code>. Now we need to update how we call this function from <code>auth.py</code>.</p>
<h3 id="heading-authpy">auth.py.</h3>
<p>We've changed the logic of how these two parts of the program talk to each other. Here's our updated auth.py:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> Security, HTTPException, status
<span class="hljs-keyword">from</span> fastapi.security <span class="hljs-keyword">import</span> APIKeyHeader
<span class="hljs-keyword">from</span> db <span class="hljs-keyword">import</span> check_api_key

api_key_header = APIKeyHeader(name=<span class="hljs-string">"X-API-Key"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_user</span>(<span class="hljs-params">api_key_header: str = Security(<span class="hljs-params">api_key_header</span>)</span>):</span>
    user = check_api_key(api_key_header)
    <span class="hljs-keyword">if</span> user:
        <span class="hljs-keyword">return</span> user
    <span class="hljs-keyword">raise</span> HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail=<span class="hljs-string">"Missing or invalid API key"</span>
    )
</code></pre>
<p>Comparing this to the previous version, you can see that we now just define the <code>get_user</code> function. This is the only function that gets called by our secure route, and we've removed the need for the separate <code>get_user_from_api_key</code> function. (Note that we've also removed it from the <code>import</code> line).</p>
<p>All that <code>get_user</code> has to do now is call the <code>check_api_key</code> function from our database and check the response. As long as it's <em>not false,</em> it will return the user object. If the response <em>is</em> false however, we know that the API key was missing or invalid, so we return the correct HTTP code just like before.</p>
<p>What's next for our little FastAPI app? I was already thinking along these lines in the previous post, but if we've handled authentication, we should start thinking about <em>authorisation</em>. What are Bob and Alice permitted to do within our app, how do we store that in a database and apply the logic in our app?</p>
<p>Maybe next time I migrate my blog, I'll write the follow up :)</p>
]]></content:encoded></item><item><title><![CDATA[Certifications in Tech: Do they matter?]]></title><description><![CDATA[I’ve been acquiring certifications in tech for over 10 years, and more recently I’ve been training a lot of people to get their own certifications as well. I can recall being sat in an actual physical classroom (remember those?) with a handful of oth...]]></description><link>https://timberry.dev/certs-in-tech-do-they-matter</link><guid isPermaLink="true">https://timberry.dev/certs-in-tech-do-they-matter</guid><category><![CDATA[learning]]></category><category><![CDATA[Certification]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Thu, 23 May 2024 11:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723632679494/72383b57-230d-4ffa-a5ca-72f59645cb3f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I’ve been acquiring certifications in tech for over 10 years, and more recently I’ve been training a lot of people to get their own certifications as well. I can recall being sat in an actual physical classroom (remember those?) with a handful of other people furiously tapping away at PCs running RedHat Linux, hoping that they’d boot back up into the desired state when the adjudicator would switch them all off at the end of the exam. I’ve also sat in exam centres multiple times with That One Guy Who Won’t Stop Coughing. So why do we put ourselves through all this? Do certs really matter?</p>
<p>Clearly they matter to a lot of people, myself included. Passing a certification is a way of displaying that you have learned the skills to pass the exam at the very least, and depending on the exam this might be a very close approximation of the skills required to do a specific job. Certification badges are a kind of <strong>reputational capital</strong>. We acquire them to show that we have accumulated skills, and they demonstrate that we have taken an interest in our own continuing professional development. Often they are a requirement, or at least an easy filtering mechanism for recruiters.</p>
<p>But are there other forms of reputational capital? Absolutely there are! Let’s not forget, not everyone copes well under exam conditions - and more importantly, most exam conditions are completely detached from what is required to actually do a job (with some exceptions, more on that in a moment!). Open-source contributions are another great way to be demonstrable about your skills and interests, as are blog posts and other forms of content creation. Conversely, you may be a highly skilled software developer but your job prevents you from sharing this with the world, so certs might be your only choice.</p>
<p>So, if certs are right for you, where do you start? My background and experience are mostly in the clouds, so here’s a quick overview of where to get started with the most relevant cloud certifications (in my humble opinion). In alphabetical order:</p>
<p><strong>Amazon Web Services (AWS)</strong></p>
<p>The AWS entry level certification is <a target="_blank" href="https://aws.amazon.com/certification/certified-cloud-practitioner/">Certified Cloud Practitioner</a> (CCP). This exam proves you have a foundational knowledge of cloud concepts, as well as AWS services and technologies. In theory, you don’t need to have technical skills to pass this exam, although you will need to memorise lots of different AWS services which tend to have obscure names! For technical certifications, AWS offers Associate level certifications, which you must have before you progress onto Professional level certifications. These are all offered in the roles of Solutions Architect, Data Engineer, Developer and SysOps Administrator. Most cloud generalists take the Solutions Architect route, starting with the <a target="_blank" href="https://aws.amazon.com/certification/certified-solutions-architect-associate/">Solutions Architect: Associate</a> exam. These are all multiple choice “closed book” exams.</p>
<p><strong>Azure</strong></p>
<p>Similar to AWS, Microsoft offers different levels of certifications, from Fundamentals to Associate, then Expert level with some Speciality exams too, and these are also generally based on roles such as Data Engineer and Administrator. If you don’t know where to specialise yet, you can start with <a target="_blank" href="https://learn.microsoft.com/en-us/credentials/certifications/azure-fundamentals/">Microsoft Certified: Azure Fundamentals</a>, before moving onto one of the associate level certs like <a target="_blank" href="https://learn.microsoft.com/en-us/credentials/certifications/azure-administrator/">Microsoft Certified: Azure Administrator Associate</a>. Depending on the level and topic of your exam, you may have a combination of multiple choice and other question types. Microsoft is also introducing some hands-on elements to their exams to test your practical skills. Make sure to check the exam guide in depth for your chosen certification!</p>
<p><strong>Google Cloud</strong></p>
<p>Almost all of Google Cloud’s certification exams are aimed at the Professional practitioner level, and based on various job roles including Cloud Architect, Data Engineer, Machine Learning Engineer and Developer. The exception to this rule is the <a target="_blank" href="https://cloud.google.com/learn/certification/cloud-engineer">Associate-level Cloud Engineer</a> exam (ACE) and the <a target="_blank" href="https://cloud.google.com/learn/certification/cloud-digital-leader">Cloud Digital Leader</a> exam. Either of these might be a good place to start for you. The Cloud Digital Leader certification is pitched as a non-technical exam for business leaders and other “tech-adjacent” stakeholders, however you will still need a good understanding of Google Cloud concepts and products, similar to how AWS sets up its CCP exam. If you’re looking to take your first technical step into Google Cloud certification, the ACE may be more appropriate for you, but it will require a good working knowledge of operating various Google Cloud solutions. Currently, all of these exams are multiple choice.</p>
<p>That’s enough vendor certs, what about cloud-native, cloud-adjacent, or just good old plain open source?</p>
<p>The Cloud Native Computing Foundation (CNCF) offers several exams in the realms of Kubernetes and other open source technologies. For example, the famous <a target="_blank" href="https://www.cncf.io/training/certification/cka/">Certified Kubernetes Administrator</a> (CKA) exam is a completely hands-on exam, testing your ability to operate, manage and fix Kubernetes clusters and their workloads (with not a single multiple-choice question in sight). This style of exam may seem more daunting than a simple question paper, but in reality it is a much better representation of what doing an actual job is like, which is why these credentials are held in such high regard. The CKA exam environment has you working with real problems on real clusters, with access to the Kubernetes project website and documentation. If you’re just getting started with Kubernetes, you may want to look at the <a target="_blank" href="https://www.cncf.io/training/certification/kcna/">Kubernetes and Cloud Native Associate</a> (KNCA) exam first, which is not hands-on; it’s a multiple choice exam aimed at a more foundational level.</p>
<p>You might have a variety of motivations for getting certified. Maybe it’s to increase your reputational capital, maybe it’s to brush up your CV, or maybe it’s just because you love learning, and each new cert is another milestone. If you’ve never thought of trying for a certification before, don’t be afraid to give one a try. There’s no harm in patting yourself on the back once in a while!</p>
]]></content:encoded></item><item><title><![CDATA[How do you get started with learning in tech?]]></title><description><![CDATA[Can you switch careers and start from scratch? Some thoughts…
Most of my time these days is spent teaching and developing teaching material for some advanced technical topics such as Kubernetes, serverless and machine learning, and most of the people...]]></description><link>https://timberry.dev/get-started-learning-in-tech</link><guid isPermaLink="true">https://timberry.dev/get-started-learning-in-tech</guid><category><![CDATA[learning]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Thu, 16 May 2024 11:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723632580381/84be85cd-9071-42aa-9963-c7ad12195508.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Can you switch careers and start from scratch? Some thoughts…</strong></p>
<p>Most of my time these days is spent teaching and developing teaching material for some advanced technical topics such as Kubernetes, serverless and machine learning, and most of the people I’m teaching have already been working in tech for a number of years. The sheer volume of things to learn about can be daunting for someone new coming into tech, especially if they’re attempting a career change or have no formal computer science education.</p>
<p>A career change is an enormous undertaking. It will require massive commitment, a great deal of time and a lot of faith. But I fundamentally believe anyone can do it, and start working in tech! So where do you start?</p>
<p>In recent years, the lack of skills in the market has prompted companies like IBM and Meta to produce some very slick online courses that can teach you everything you need to know to get started in some specific tech areas. Often they will assume no prior knowledge and will guide you through everything you need to know at a foundational level, with guided lessons and activities. These include Google’s Cybersecurity Professional Certificate, Meta’s Frontend Developer Professional Certificate, and IBM’s Full Stack Software Developer and DevOps programmes. You can find links to these and more at Coursera’s <a target="_blank" href="https://www.coursera.org/career-academy/">Find Your New Career</a> page.</p>
<p>And don’t forget that you can “audit” any Coursera class <strong>for free</strong> if you don’t want to pay right now (or at all). That just means you won’t get the certification or the feedback, but if you’re just there to learn you’ll still have access to everything you need.</p>
<p>While these courses are great, they do require you to have at least an inkling of the career direction you want to take up front. Would you want to spend 6 months learning full stack development, only to discover a yearning to be a data analyst? These courses also take a very vocational approach, focusing on specific skills for a specific role in the industry.</p>
<p>For me, the joy of learning comes from a life-long personal interest in tech. How is it put together, and why does it work the way it does? Exploring some fundamentals of computer science might help you to decide if tech really is for you, and you might discover which aspects of it inspire your own joy along the way.</p>
<p>One of the best online courses for doing this is <a target="_blank" href="https://cs50.harvard.edu/college/2024/fall/">Harvard’s CS50</a>, taught by the incredible David J. Malan. This course takes you through foundational building blocks but also throws you in at the deep end with C, explores data structures and algorithms and even covers some basic web and cybersecurity concepts. The course has been taught to thousands of people around the world through Harvard’s <a target="_blank" href="https://cs50.harvard.edu/x/2024/">OpenCourseware</a> or <a target="_blank" href="https://cs50.edx.org/">Edx</a> (both have free options). It’s a great way to find out what excites you about tech, and if you want to continue your journey afterwards, it even offers specialisms in web development, Python, game development, and of course AI.</p>
<p>What was your path into tech? Did you switch from another career, and if so, how? I’d love to hear from you if you have other amazing educational resources to share.</p>
]]></content:encoded></item><item><title><![CDATA[A simple Python FastAPI template with API key authentication]]></title><description><![CDATA[I’ve been dusting off my Python recently and building some APIs with FastAPI. I had a pretty simple use case - a backend API that uses API keys for authentication. However, I couldn’t find a straightfoward tutorial. The FastAPI docs are very good but...]]></description><link>https://timberry.dev/fastapi-with-apikeys</link><guid isPermaLink="true">https://timberry.dev/fastapi-with-apikeys</guid><category><![CDATA[Python]]></category><category><![CDATA[FastAPI]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Fri, 05 Jan 2024 12:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723632219071/b9c9476f-ab7c-4a96-8d91-f67b4957f2b3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I’ve been dusting off my Python recently and building some APIs with <a target="_blank" href="https://fastapi.tiangolo.com/">FastAPI</a>. I had a pretty simple use case - a backend API that uses API keys for authentication. However, I couldn’t find a straightfoward tutorial. The FastAPI docs are very good but a little too in-depth; for example they bundle the creation of OAuth bearer tokens into the same code as the backend itself. So I wanted to create a template for myself that simplifies the authentication, abstracts the database stuff, and also follows a reasonable best-practices approach to project layout, modules, packages and so on. So here it is! But be warned; like I said, I’m still rusty. If you spot anything obviously terrible in the code below, please let me know (nicely).</p>
<h2 id="heading-project-layout">Project layout</h2>
<p>We’ll start with a simple Python project layout for a FastAPI app.</p>
<pre><code class="lang-plaintext">.
├── app
│   ├── __init__.py
│   ├── auth.py
│   ├── db.py
│   ├── main.py
│   └── routers
│       ├── __init__.py
│       ├── public.py
│       ├── secure.py
├── venv
</code></pre>
<p>Hopefully this makes sense if you’re fairly comfortable with Python. <code>app</code> is our Python package in this case, <code>main</code> is our main module, while <code>auth</code> and <code>db</code> are utility modules in the package. <code>routers</code> is a subpackage containing the <code>public</code> and <code>secure</code> submodules. There’s empty <code>__init__.py</code> files aplenty to tell Python this is a proper package, and in my case I also have a <code>venv</code> directory for my virtual environment, although that isn’t required. Now let’s go into what each of these files actually does.</p>
<h3 id="heading-mainpy">main.py</h3>
<p>Here’s our main module:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> FastAPI, Depends
<span class="hljs-keyword">from</span> routers <span class="hljs-keyword">import</span> secure, public
<span class="hljs-keyword">from</span> auth <span class="hljs-keyword">import</span> get_user

app = FastAPI()

app.include_router(
    public.router,
    prefix=<span class="hljs-string">"/api/v1/public"</span>
)
app.include_router(
    secure.router,
    prefix=<span class="hljs-string">"/api/v1/secure"</span>,
    dependencies=[Depends(get_user)]
)
</code></pre>
<p>We start by importing the modules we’ll need from FastAPI, namely itself and the <code>Depends</code> module. From our own <code>auth</code> module we import a function called <code>get_user</code> which we’ll explain in a moment. From our <code>routers</code> subpackage we import <code>secure</code> and <code>public</code>. And we create our app with <code>app = FastAPI()</code>.</p>
<p>So what are these routers? Well, in a simple FastAPI app we could just start defining paths and responses in our main program. Here’s a super simple example:</p>
<pre><code class="lang-python"><span class="hljs-meta">@app.get("/")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">root</span>():</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"message"</span>: <span class="hljs-string">"Hello World"</span>}
</code></pre>
<p>In 3 lines we can specify that we want to return “Hello World” for every <code>GET</code> request to <code>/</code> (asynchronously to boot!). But this doesn’t scale very well.</p>
<p>A router is a separate set of paths and functions grouped together. They’re useful for grouping features of a much larger API together into logical pieces. In this tutorial, we’ll create two groups of paths: <code>secure</code> for paths that require an API key, and <code>public</code> for those that don’t.</p>
<p>Let’s look at this piece again:</p>
<pre><code class="lang-python">app.include_router(
    public.router,
    prefix=<span class="hljs-string">"/api/v1/public"</span>
)
</code></pre>
<p>Here we call the <code>include_router</code> function on our <code>app</code>, and pass in 2 paramters. The first, <code>public.router</code> is the <code>router</code> object from the <code>public</code> module. The second <code>prefix</code>, defines where these paths sit in the context of our overall API app. As you can see, routers make it easy to change URL prefixes and even introduce API versioning.</p>
<p>The next one is a bit more complex:</p>
<pre><code class="lang-python">app.include_router(
    secure.router,
    prefix=<span class="hljs-string">"/api/v1/secure"</span>,
    dependencies=[Depends(get_user)]
)
</code></pre>
<p>Now we’re using the <code>router</code> object from the <code>secure</code> module, and we’ve added a dependency using FastAPI’s <code>Depends</code> function. The dependency is on the <code>get_user</code> function, that we’ve defined in the <code>auth</code> module. As you can probably guess, that means that all paths in this router <em>depend</em> on that function. So let’s take a look at it.</p>
<h3 id="heading-authpy">auth.py</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> Security, HTTPException, status
<span class="hljs-keyword">from</span> fastapi.security <span class="hljs-keyword">import</span> APIKeyHeader
<span class="hljs-keyword">from</span> db <span class="hljs-keyword">import</span> check_api_key, get_user_from_api_key

api_key_header = APIKeyHeader(name=<span class="hljs-string">"X-API-Key"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_user</span>(<span class="hljs-params">api_key_header: str = Security(<span class="hljs-params">api_key_header</span>)</span>):</span>
    <span class="hljs-keyword">if</span> check_api_key(api_key_header):
        user = get_user_from_api_key(api_key_header)
        <span class="hljs-keyword">return</span> user
    <span class="hljs-keyword">raise</span> HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail=<span class="hljs-string">"Missing or invalid API key"</span>
    )
</code></pre>
<p>First we import some more useful stuff from FastAPI, including methods for dealing with headers and HTTP exceptions. They’ll make more sense when we get to the part that actually uses them in a moment. And yes, I do appreciate that this is a bit circular, because next we’re importing functions from <code>db</code> which we haven’t seen yet. But before we get into that file, just know that <code>check_api_key</code> returns <code>True</code> if an API key is valid, and <code>get_user_from_api_key</code> returns a user object from a valid API key.</p>
<p>The main purpose of the auth module is to provide us with the <code>get_user</code> function, which as we saw in the previous file, our secure routes depend on. This function accepts one parameter - the value of <code>X-API-Key</code> that has been sent in the request. Using the <code>APIKeyHeader</code> and <code>Security</code> functions in FastAPI allows us to define the header name for our API key (and therefore populate this automatically in our OpenAPI documentation) and extract it from the header.</p>
<p>With the API key stored in <code>api_key_header</code>, we next call our <code>check_api_key</code> function (again, we’ll see how that works in a moment). If the response is <code>True</code>, we proceed to get some user data by calling <code>get_user_from_api_key</code> and then return it.</p>
<p>If <code>check_api_key</code> returned <code>False</code>, then the API key we checked wasn’t valid. So we use FastAPI’s <code>HTTPException</code> to return a suitable response, in this case a 401 error.</p>
<p>Now we know what has to happen for every path that <em>depends</em> on <code>get_user</code>. Before we look at those paths and routes, let’s quickly explore <code>db.py</code>, which is where we defined <code>check_api_key</code> and <code>get_user_from_api_key</code>.</p>
<h3 id="heading-dbpy">db.py</h3>
<p>This file is basically a placeholder for whatever backend database integration you want to use. Right now, it implements the methods we needs, and contains a hard-coded set of API keys and users.</p>
<pre><code class="lang-python">api_keys = {
    <span class="hljs-string">"e54d4431-5dab-474e-b71a-0db1fcb9e659"</span>: <span class="hljs-string">"7oDYjo3d9r58EJKYi5x4E8"</span>,
    <span class="hljs-string">"5f0c7127-3be9-4488-b801-c7b6415b45e9"</span>: <span class="hljs-string">"mUP7PpTHmFAkxcQLWKMY8t"</span>
}

users = {
    <span class="hljs-string">"7oDYjo3d9r58EJKYi5x4E8"</span>: {
        <span class="hljs-string">"name"</span>: <span class="hljs-string">"Bob"</span>
    },
    <span class="hljs-string">"mUP7PpTHmFAkxcQLWKMY8t"</span>: {
        <span class="hljs-string">"name"</span>: <span class="hljs-string">"Alice"</span>
    },
}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_api_key</span>(<span class="hljs-params">api_key: str</span>):</span>
    <span class="hljs-keyword">return</span> api_key <span class="hljs-keyword">in</span> api_keys

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_user_from_api_key</span>(<span class="hljs-params">api_key: str</span>):</span>
    <span class="hljs-keyword">return</span> users[api_keys[api_key]]
</code></pre>
<p>I’m sure I don’t need to tell you - <strong>don’t do this in production!</strong> Instead, you should incorporate whatever database connections and query methods you need to to acquire API keys and user details from your database (how you actually <em>create</em> those keys is another discussion). The actual data structures may be very different as well.</p>
<p>For now, we have a simple <code>dict</code> of key:value pairs for <code>api_keys</code> where the key is the actual API key, and the value is the user ID. I’m using long UUIDs for API keys, and short UUIDs for user IDs, but again this could be something very different. For the users we have another <code>dict</code> where the key is the user ID, and the value is another <code>dict</code> to hold all the important user data. Right now we’re just storing a name.</p>
<p><code>check_api_key</code> simply checks that the API key exists in the list of keys it knows about. Again, you’d probably need to change this for some sort of DB lookup in production.</p>
<p><code>get_user_from_api_key</code> then uses the API key to get a corresponding user ID, and returns the nested user <code>dict</code>.</p>
<p>So far we have all of our basic functionality in terms of API authentication. Now we just need some routes!</p>
<h3 id="heading-routerspublicpy">routers/public.py</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> APIRouter

router = APIRouter()

<span class="hljs-meta">@router.get("/")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_testroute</span>():</span>
    <span class="hljs-keyword">return</span> <span class="hljs-string">"OK"</span>
</code></pre>
<p>This is about as simple as it gets. In this module, all we do is define a <code>router</code> object, with a single path. When a request is made to <code>/</code>. it will return <code>"OK"</code>. Don’t forget, the <em>actual</em> path is relative, and if you refer back to <code>main.py</code> you’ll see we set up this router to run under <code>/api/v1/public</code>.</p>
<h3 id="heading-routerssecurepy">routers/secure.py</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> APIRouter, Depends
<span class="hljs-keyword">from</span> auth <span class="hljs-keyword">import</span> get_user

router = APIRouter()

<span class="hljs-meta">@router.get("/")</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_testroute</span>(<span class="hljs-params">user: dict = Depends(<span class="hljs-params">get_user</span>)</span>):</span>
    <span class="hljs-keyword">return</span> user
</code></pre>
<p>The secure routes are only slightly more complex. Just like in <code>main.py</code> we’re using FastAPI’s <code>Depends</code> to specify that we are dependent on a function, in this case: <code>get_user</code>. So we know this route is guaranteed secure, and can only be accessed by someone with a valid API key, because without one you won’t get through the <code>get_user</code> function without triggering the HTTP Not Authorized error we set up earlier. If you have a valid key, for now we just return the user object.</p>
<h2 id="heading-setup">Setup</h2>
<p>This may vary based on your own development preferences, but the quickest way I find to set up a FastAPI app is to create a <code>virtualenv</code>, install the FastAPI package, and then use <a target="_blank" href="https://www.uvicorn.org/">Uvicorn</a> to run the app. You may have noticed the <code>venv</code> directory earlier - I created this with:</p>
<pre><code class="lang-bash">python -mvenv venv
</code></pre>
<p>I then activated the virtual environment and installed FastAPI:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> venv/bin/active
pip install <span class="hljs-string">"fastapi[standard]"</span>
</code></pre>
<p>Finally, to run my FastAPI app, I can now run this command from the <code>app</code> directory:</p>
<pre><code class="lang-bash">uvicorn main:app --reload
</code></pre>
<p>The <code>--reload</code> option will reload my code automatically if I make any changes. If everything has worked for you, you should be able to access your app at <a target="_blank" href="http://127.0.0.1:8000/">http://127.0.0.1:8000/</a></p>
<h2 id="heading-testing">Testing</h2>
<p>First we sent a GET request with no API key to the public endpoint <code>/api/v1/public/</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551250791/33a3bc42-1181-4845-9324-e67d6ef025f4.png" alt="A GET request to /api/v1/public/ with no API key header returns an 'OK' response." class="image--center mx-auto" /></p>
<p>Now let’s try the secure endpoint at <code>/api/v1/secure/</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551252592/7f7f4521-7431-41a2-aacb-e91a9fda3601.png" alt="A GET request to /api/v1/secure/ with no API key header returns an HTTP 403 error." class="image--center mx-auto" /></p>
<p>The API returns a 403 error. Note, this isn’t the 401 error we coded for when an API key isn’t found in the database. FastAPI can’t even find an API key, because we haven’t specified one, so immediately returns the 403 Forbidden error.</p>
<p>Let’s try a random API key that is <em>not</em> in the database:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551254352/6e450214-e395-4316-ba89-b9f3bc2c3a0f.png" alt="A GET request to /api/v1/secure/ with an invalid API key header returns an HTTP 401 error." class="image--center mx-auto" /></p>
<p>This time, we’ve supplied an <code>X-API-Key</code> header, but the key we’re supplying isn’t in the database. So we return a 401 Unauthorized error (technically, we’re <em>unauthenticated</em> but these error codes were written a long time ago!)</p>
<p>Finally, let’s try a working API key, for the user Bob:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551256128/c0e392cf-efdd-4493-8033-c33d31d995b1.png" alt="A GET request to /api/v1/secure/ with an valid API key header successfully returns a username." class="image--center mx-auto" /></p>
<p>Success! The API key is found, checked and proven valid, and the user details are retrieved and used in the response.</p>
<h2 id="heading-whats-next">What’s next?</h2>
<p>This is hopefully some helpful project boilerplate to get you started on your own FastAPI app. Don’t forget, we’ve hardcoded a dummy database in <code>db.py</code>, so connecting the authentication up to a real database should be the next thing we do. I’ll try to tackle this in the next post!</p>
<p>From there, we could continue to implement different routes and paths, which brings us onto the topics that come after authentication - authorisation and/or admission control.</p>
<p>Let me know what you think in the comments!</p>
]]></content:encoded></item><item><title><![CDATA[Fediverse (and Mastodon) first impressions]]></title><description><![CDATA[You probably can’t have failed to notice that Twitter is weird right now. There are some good writeups you can read in other places about the bad things that are going on (such as firing the entire accessibility team), and the further bad things that...]]></description><link>https://timberry.dev/fediverse-and-mastodon-first-impressions</link><guid isPermaLink="true">https://timberry.dev/fediverse-and-mastodon-first-impressions</guid><category><![CDATA[Fediverse]]></category><category><![CDATA[Mastodon]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Sat, 05 Nov 2022 12:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723630830035/65c4def6-3cb3-4667-b20d-a2b01b2fd13d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You probably can’t have failed to notice that Twitter is weird right now. There are some good writeups you can read in <a target="_blank" href="https://www.theguardian.com/technology/2022/nov/04/twitter-layoffs-misinformation-moderation">other</a> <a target="_blank" href="https://whatever.scalzi.com/2022/11/02/that-whole-twitter-thing-further-thoughts/">places</a> about the bad things that are going on (such as firing the entire accessibility team), and the further bad things that may happen as a result.</p>
<p>In the meantime, several people (about <a target="_blank" href="https://techcrunch.com/2022/11/03/decentralized-social-network-mastodon-grows-to-655k-users-in-wake-of-elon-musks-twitter-takeover/">230k</a> in the last few days) have decided to seek out an alternative community micro-blogging platform, such as Mastodon, which is part of the Fediverse. It’s always difficult to turn your back on a platform after so many years, and learn new terminology and new ways of doing things. As someone who has now been using these things for just a few short days, I’d like to try to help anyone else who wants to make the leap!</p>
<p>The first thing to realise is that Mastodon is not a replacement for a Twitter <em>experience</em>. It’s a different, and I think most people will agree, <em>better</em> experience. To explain how it all works, let’s look at some of the individual components.</p>
<h2 id="heading-the-fediverse">The Fediverse</h2>
<p>The Fediverse is the name given to a collection of connected servers that can share things with each other. The Fediverse gets its name from being a universe of federation. Federation in this context means lots of connected servers, that can run different types of platforms, and that are not owned or managed by a single entity (see how it’s different already?). These servers might share videos, or posts, or something else depending on the software they are running, but they mostly all share information with each other using an open standard called ActivityPub.</p>
<h2 id="heading-mastodon">Mastodon</h2>
<p>Mastodon is one of the most popular Fediverse platforms because it provides a very feature-rich micro-blogging social networking platform not entirely unlike Twitter (but again - better! I’ll explain more on that in a moment). In Mastodon:</p>
<ul>
<li><p>You don’t tweet, you <strong>post</strong></p>
</li>
<li><p>You don’t “like” a post, you <strong>favourite</strong> it</p>
</li>
<li><p>You don’t retweet, you <strong>boost</strong> a post</p>
</li>
</ul>
<p>From that basic point of view, it’s a similar experience to Twitter. Notably, the ability to do something akin to “quote tweeting” is disabled on most Mastodon sites. This is to try to limit the negative behaviour that, it turns out, quote tweeting has been promoting all this time.</p>
<h2 id="heading-instances">Instances</h2>
<p>Anyone with a bit of know-how and some server resources can run their own Mastodon instance. But that <em>doesn’t</em> mean that you’re joining a micro-blogging site with just a handful of people, because all Mastodon instances can share their posts and users with the rest of the Fediverse. You can follow anyone you like (unless they have opted to prevent people from following them), and it doesn’t matter which instance they are using, because all of the information is federated. People can even move their accounts from one instance to another if they want to.</p>
<p>So if everything is decentralised and federated, where do these individual server instances come into it?</p>
<h2 id="heading-the-experience">The Experience</h2>
<p>This is the key thing that helps you build your own experience in Mastodon and the Fediverse. You can choose an instance run by and populated by like-minded people; and that doesn’t have to mean people who have exactly the same hobbies as you, they might just share the same sensibilities. Your instance admins will work hard to maintain the server, and provide rules on what sort of behaviour they expect. This allows you to join a community with rules you respect. You can easily discover people local to your server or of course explore the rest of the Fediverse, but it’s likely your server admin may restrict the sharing of information with instances that they feel go against their own rules.</p>
<p>So your experience is now based on like-minded communities, and exploring and building connections to other interesting people, while at the same time being protected from some of the more extreme elements of the Internet. And of course there are no adverts or corporate interests, so you’re no longer being manipulated by algorithms to keep you addicted to doom-scrolling. For those of us who’ve been around since the dawn of the Internet, it’s like a welcome return to the more community-focused, decentralised places we used to visit before we all became beholden to social media giants and their need for ad clicks.</p>
<h2 id="heading-how-to-get-started">How to Get Started</h2>
<p>Even though Mastodon has been around for a while now, it’s still early days for its mass adoption, although I expect this to grow quickly with the exodus from Twitter. Right now though there aren’t that many user-friendly newbie guides for folks who aren’t already technically inclined.</p>
<p>The first thing you’ll need to do is choose an instance to join, and you should take your time to find the right base community for yourself. Don’t sweat it too much though, because you can move between instances in the future. Here’s some links to help get you started:</p>
<ul>
<li><p><a target="_blank" href="https://instances.social/">instances.social</a> is a curated list of currently available Mastodon instances, that also has a “wizard” to help you choose an instance.</p>
</li>
<li><p><a target="_blank" href="https://fedi.directory/">fedi.directory</a> is another curated list, this time of interesting people to follow.</p>
</li>
<li><p><a target="_blank" href="https://fedi.tips/">fedi.tips</a> contains lots of informal and unofficial guides for all aspects of using Mastodon. There’s a lot of information there, so it can be a bit overwhelming at first.</p>
</li>
</ul>
<p>Once you’ve found your community, just start to explore and take your time. When you feel settled, you should consider supporting your instance admin in whatever way they deem appropriate.</p>
<p>For me, Twitter was always a way to keep up with tech news (and sci-fi, books, RPGs and all things Trek). But it was always hard work to remove the doom, gloom, hate and deliberately triggering click-bait. I’ve only been using Mastodon for a week or so, but it’s already the complete opposite of that experience, and as more people make the migration, I don’t feel like I’m missing any tech news either! (or sci-fi, books, RPGs etc.)</p>
<p>I hope this post has helped you! Feel free to follow me at <a target="_blank" href="https://hachyderm.io/web/@timberry">@timberry@hachyderm.io</a></p>
]]></content:encoded></item><item><title><![CDATA[Building Go containers for Cloud Run]]></title><description><![CDATA[As I learned by writing my Go tutorial series, Go is great for making lightweight web apps, and it also lends itself very easily to being packaged and run in a docker container. Go compiles to a static binary (most of the time), which means your runt...]]></description><link>https://timberry.dev/building-go-containers-for-cloud-run</link><guid isPermaLink="true">https://timberry.dev/building-go-containers-for-cloud-run</guid><category><![CDATA[golang]]></category><category><![CDATA[#cloudrun]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Sun, 02 Feb 2020 12:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551269023/d96327b8-19ab-418a-83c5-949e1ef4a966.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As I learned by writing my Go tutorial series, Go is great for making lightweight web apps, and it also lends itself very easily to being packaged and run in a docker container. Go compiles to a static binary (most of the time), which means your runtime container can be extremely small and efficient.</p>
<p>Once your app is containered up, deploying it to Google’s <a target="_blank" href="https://cloud.google.com/run/docs/">Cloud Run</a> is really easy and gives you out-of-the-box auto-scaling and a secure HTTPS front-end. To follow along below, make sure you have the <a target="_blank" href="https://cloud.google.com/sdk/docs/quickstarts/">Google Cloud SDK</a> set up on your local system.</p>
<p>First we’ll use a multi-stage docker build. In the directory of your Go web app, simply add the following <code>Dockerfile</code>:</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> golang:<span class="hljs-number">1.13</span> as build
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /go/src/app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . .</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go build -v -o app .</span>

<span class="hljs-keyword">FROM</span> gcr.io/distroless/base
<span class="hljs-keyword">COPY</span><span class="bash"> --from=build /go/src/app/. /</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"/app"</span>]</span>
</code></pre>
<p>In the first stage of this file, we use the <code>golang:1.13</code> image to give us a build environment for our app. We copy everything from our local filesystem into the <code>/go/src/app</code> directory inside the container environment. Then we run <code>go build</code> to compile everything. Simple!</p>
<p>The next stage is the clever part. We start a new image from <code>gcr.io/distroless/base</code> and copy over just the files from our build stage (including our compiled runtime). In this example, we’re assuming there are supporting files to copy as well (for example, HTML and other static content), but we could refine this even more by just copying the binary application. Google’s <a target="_blank" href="https://github.com/GoogleContainerTools/distroless">distroless project</a> contains just enough Linux to run our compiled binary. There’s no package manager, no shell, so it makes for a very efficient image.</p>
<p>You can build this image locally with Docker and push it to Google <a target="_blank" href="https://cloud.google.com/container-registry/docs/">Container Registry</a>, or just use Google’s <a target="_blank" href="https://cloud.google.com/cloud-build/docs/">Cloud Build</a> to do the work for you:</p>
<pre><code class="lang-bash">gcloud builds submit --tag gcr.io//&lt;your-project-id&gt;/&lt;your-image-name&gt; .
</code></pre>
<p>Now you can deploy your app with just one more command:</p>
<pre><code class="lang-bash">gcloud run deploy &lt;deployment-name&gt; gcr.io/&lt;your-project-id&gt;/&lt;your-image-name&gt;
</code></pre>
<p><em>(Don’t forget to replace &lt;your-project-id&gt;, &lt;your-image-name&gt; and &lt;deployment-name&gt; in these examples)</em></p>
<p>In the <a target="_blank" href="https://console.cloud.google.com/run">Cloud Run console</a> you should now see your new deployed service, complete with its URL.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551267242/1e348359-0f14-4fee-92d0-cb681ebe38ca.png" alt="Gopher in a box" class="image--center mx-auto" /></p>
<p>Cloud Run offers some great features like traffic <a target="_blank" href="https://cloud.google.com/run/docs/rollouts-rollbacks-traffic-migration">splitting and migration</a> and <a target="_blank" href="https://cloud.google.com/run/docs/mapping-custom-domains">custom domains</a>. Deploying container based applications has never been easier! Except when things go wrong…</p>
<h2 id="heading-missing-libraries">Missing libraries</h2>
<p>You may have successfully tested your Docker image locally, but it’s failing when you deploy it. If you’re really unlucky, you may receive this error message in your docker logs:</p>
<pre><code class="lang-plaintext">exec user process caused "no such file or directory"
</code></pre>
<p>This error is almost completely useless, and will probably send you down the path of debugging the contents of your image looking for missing files or directories. What’s actually at fault is that there are <em>system libraries</em> missing that Go needs, because they are not part of your base image. This happens when Go’s statically compiled binaries aren’t quite as static as we’d like them to be.</p>
<p>To fix this, we need to send some extra parameters to the <code>go build</code> command so that it also compiles in any libraries it needs:</p>
<pre><code class="lang-bash">CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o app .
</code></pre>
<p>This should fix the problem most of the time.</p>
<h2 id="heading-but-i-need-cgo">But I need CGO!</h2>
<p>The above fix works great as long as switching off <a target="_blank" href="https://golang.org/cmd/cgo/">CGO</a> isn’t a problem. But some Go libraries require CGO (for example, the rather handy <code>go-sqlite3</code>). So to create a completely static binary while still allowing CGO, we have to update our build line again.</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">RUN</span><span class="bash"> go get -d -v ./...</span>
<span class="hljs-keyword">RUN</span><span class="bash"> CGO_ENABLED=1 GOOS=linux go build -a -ldflags <span class="hljs-string">'-linkmode external -extldflags "-static"'</span> -o app .</span>
</code></pre>
<p>First we run <code>go get</code> to include any external dependencies. Then we send extra parameters to <code>go build</code> to make sure it <em>really</em> includes everything. This results in a much longer build process, but hopefully a static binary that works.</p>
<p>Hooray for write once, run <em>almost</em> everywhere! 😊</p>
]]></content:encoded></item><item><title><![CDATA[Learning to Go: Part 7 - Adding Images]]></title><description><![CDATA[Oh hi! This is a multi-part tutorial series, please make sure you reads the posts in order! You can find the list at the bottom of the page.
We’ve come a long way, from a simple Hello World to a fully functional dialog search engine for Star Trek: Th...]]></description><link>https://timberry.dev/learning-to-go-part-7-adding-images</link><guid isPermaLink="true">https://timberry.dev/learning-to-go-part-7-adding-images</guid><category><![CDATA[golang]]></category><dc:creator><![CDATA[Tim Berry]]></dc:creator><pubDate>Tue, 28 Jan 2020 12:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723629155186/53c989fc-496f-4490-a208-2b55ae796237.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Oh hi! This is a multi-part tutorial series, please make sure you reads the posts in order! You can find the list at the bottom of the page.</em></p>
<p>We’ve come a long way, from a simple <em>Hello World</em> to a fully functional dialog search engine for Star Trek: The Next Generation. This is quite an achievement! But how can we improve on it further? Wouldn’t it be nice to see some pictures with our search results? Picard’s shiny head, or Riker’s magical beard for example.</p>
<h2 id="heading-another-api">Another API</h2>
<p>Unfortunately, the API we’re querying doesn’t contain any image data or URLs. However, it <em>does</em> contain the IMDB ID of the episode, which we can use to query a further API for some image data. To do this, we’re going to use <a target="_blank" href="https://www.themoviedb.org/">The Movie Database (TMDb)</a>, a community driven movie database with a fantastic public API.</p>
<p>To use this API, you will need to sign up for a free TMDb developer account and create your own API key. <a target="_blank" href="https://developers.themoviedb.org/3/getting-started/introduction">Full instructions can be found here</a>. Make a note of the key - we’ll use it in a moment.</p>
<h2 id="heading-more-structs">More structs</h2>
<p>We’re going to be passing multiple sources of data to our template, so we’ll need to change the way we’re using our structs. First, we’ll change the <code>Dialog []struct</code>. Rather than keep this struct as a slice, we’ll make it a singular struct (you’ll see why in a moment). To do this, just change its first line to:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> Dialog <span class="hljs-keyword">struct</span> {
</code></pre>
<p>Now we’ll create a new type of struct, which will contain all of the results we want to send to our template. Enter this after the complete defintion for the <code>Dialog</code> struct:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> Results <span class="hljs-keyword">struct</span> {
    SearchKey <span class="hljs-keyword">string</span>
    Lines     []Dialog
    Images    <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]<span class="hljs-keyword">string</span>
}
</code></pre>
<p>As you can see, we’re now creating as field called <code>Lines</code>, which will be the slice of <code>Dialog</code>s. Inside this new struct, we also store our original <code>SearchKey</code>, and a new field called <code>Images</code>. This is a <em>map</em>, a series of key+value pairs. You may be familiar with maps from Java, or dictionaries from Python, which are essentially the same thing.</p>
<p>Next we need to create the struct for the TMDb API. Enter the following:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> TMDBQuery <span class="hljs-keyword">struct</span> {
    MovieResults     []<span class="hljs-keyword">string</span> <span class="hljs-string">`json:"-"`</span>
    PersonResults    []<span class="hljs-keyword">string</span> <span class="hljs-string">`json:"-"`</span>
    TvResults        []<span class="hljs-keyword">string</span> <span class="hljs-string">`json:"-"`</span>
    TvEpisodeResults []<span class="hljs-keyword">struct</span> {
        AirDate        <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"air_date"`</span>
        EpisodeNumber  <span class="hljs-keyword">int</span>     <span class="hljs-string">`json:"episode_number"`</span>
        ID             <span class="hljs-keyword">int</span>     <span class="hljs-string">`json:"id"`</span>
        Name           <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"name"`</span>
        Overview       <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"overview"`</span>
        ProductionCode <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"production_code"`</span>
        SeasonNumber   <span class="hljs-keyword">int</span>     <span class="hljs-string">`json:"season_number"`</span>
        ShowID         <span class="hljs-keyword">int</span>     <span class="hljs-string">`json:"show_id"`</span>
        StillPath      <span class="hljs-keyword">string</span>  <span class="hljs-string">`json:"still_path"`</span>
        VoteAverage    <span class="hljs-keyword">float64</span> <span class="hljs-string">`json:"vote_average"`</span>
        VoteCount      <span class="hljs-keyword">int</span>     <span class="hljs-string">`json:"vote_count"`</span>
    } <span class="hljs-string">`json:"tv_episode_results"`</span>
    TvSeasonResults <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"-"`</span>
}
</code></pre>
<p>There’s quite a lot here but it’s easy to break down. First of all - ignore <code>MovieResults</code>, <code>PersonResults</code> and <code>TvSeasonResults</code>. The API we’re querying could potentially return results in these 3 keys, but we don’t want to use them - we know the IDs we are sending in our query will only return <code>TvEpisodeResults</code>. That’s why we are tagging them with <code>json:"-"</code>. This tells the JSON Decoder to completely ignore them, and anything they might contain. Inside <code>TvEpisodeResults</code> we can see all the data we will scrape from this API, the most important part being <code>StillPath</code>, which will lead us to a public URL for a screenshot from the episode.</p>
<p><code>TvEpisodeResults</code> is also a slice, as denoted by <code>[]struct</code>, because the API could potentially return mutliple results in this field. However, as we’re sending the IMDB ID in our query, we can be sure that this slice will only contain a single entry.</p>
<h2 id="heading-using-the-api-key">Using the API key</h2>
<p>Remember that API key you got from TMBb? Store it in a string variable now. Only joking! Of course, hard-coding API keys and other sensitive information inside your program is a terrible idea. Instead, just above the <code>var tpl</code> defintion, add this line:</p>
<pre><code class="lang-go"><span class="hljs-keyword">var</span> API_KEY <span class="hljs-keyword">string</span>
</code></pre>
<p>We’re declaring the variable, but we haven’t stored anything in it yet. Jump down into your <code>main()</code> function and add this at the start of it:</p>
<pre><code class="lang-go">API_KEY = os.Getenv(<span class="hljs-string">"API_KEY"</span>)
<span class="hljs-keyword">if</span> API_KEY == <span class="hljs-string">""</span> {
    fmt.Println(<span class="hljs-string">"No API_KEY in environment"</span>)
    os.Exit(<span class="hljs-number">1</span>)
}
</code></pre>
<p><code>os.Getenv</code> grabs the API_KEY from the runtime environment. Then we check to make sure the variable isn’t empty - if it is, we quit with an error message.</p>
<p>On Linux and Mac systems, you can set an environment variable like this:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> API_KEY=EJ5RkgowmPfwSIra9EDqelQirMOSoyd6
</code></pre>
<p>(Obviously, replace the key with the one you’ve obtained. That’s not a real key!) I’ve never tried this with Windows, but I <a target="_blank" href="http://www.dowdandassociates.com/blog/content/howto-set-an-environment-variable-in-windows-command-line-and-registry/">googled this for you</a> :)</p>
<h2 id="heading-updating-the-search-handler">Updating the search handler</h2>
<p>Next, we’ll make several changes to our <code>searchHandler</code> function. To start with, replace:</p>
<pre><code class="lang-go">dialog = &amp;Dialog{}
</code></pre>
<p>with:</p>
<pre><code class="lang-go">results := &amp;Results{}
results.Images = <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]<span class="hljs-keyword">string</span>)
results.SearchKey = searchKey
</code></pre>
<p>We’re now creating an instance of our new <code>Results</code> struct, which can store the search key, lines of dialog, and a map of images. We call <code>make</code> to set up our empty map, specifying that the keys and values will both be strings. Then we store the <code>searchKey</code> we retreived from the HTTP request as <code>results.SearchKey</code>.</p>
<p>The next few parts of our function stay the same. We still conctact the original API and check for errors or return codes that are not OK. But we’ll change the call to <code>json.NewDecoder</code> to:</p>
<pre><code class="lang-go">err = json.NewDecoder(resp.Body).Decode(&amp;results.Lines)
</code></pre>
<p>We’re now storing the results in <code>Lines</code>, a field of our <code>results</code> object. Leave the next piece of error checking in place. The next line should be <code>spew.Dump(dialog)</code>. We’ll delete that because we no longer need this level of debug, and we no longer have a variable called <code>dialog</code>!</p>
<p>Now comes a rather large chunk of code to enter:</p>
<pre><code class="lang-go"><span class="hljs-keyword">for</span> _, d := <span class="hljs-keyword">range</span> results.Lines {
    tmdbquery := &amp;TMDBQuery{}
    imdbid := d.Imdbid
    _, ok := results.Images[imdbid]
    <span class="hljs-keyword">if</span> !ok {
        tmdbep := fmt.Sprintf(<span class="hljs-string">"https://api.themoviedb.org/3/find/%s?api_key=%s&amp;language=en-US&amp;external_source=imdb_id"</span>, imdbid, API_KEY)
        tmdbresp, err := http.Get(tmdbep)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            w.WriteHeader(http.StatusInternalServerError)
            <span class="hljs-keyword">return</span>
        }
        <span class="hljs-keyword">defer</span> tmdbresp.Body.Close()
        <span class="hljs-keyword">if</span> tmdbresp.StatusCode != <span class="hljs-number">200</span> {
            w.WriteHeader(http.StatusInternalServerError)
            <span class="hljs-keyword">return</span>
        }

        err = json.NewDecoder(tmdbresp.Body).Decode(&amp;tmdbquery)

        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            w.WriteHeader(http.StatusInternalServerError)
            <span class="hljs-keyword">return</span>
        }
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(tmdbquery.TvEpisodeResults) &gt; <span class="hljs-number">0</span> {
            results.Images[imdbid] = tmdbquery.TvEpisodeResults[<span class="hljs-number">0</span>].StillPath
        } <span class="hljs-keyword">else</span> {
            results.Images[imdbid] = <span class="hljs-string">"8do6gZErem4wfdPwALiT8agtJfb.jpg"</span>
        }
    }
}
</code></pre>
<p>Phew! Let’s go through some of that to clarify what it’s doing, although none of it is particularly complicated. This is our first <em>for</em> loop in Go. This is actually the only looping construct in Go - if you’re used to <em>whiles</em> or <em>untils</em> you’ll have to get used to doing everything with <em>for</em>.</p>
<p>Using <code>range results.Lines</code> will iterate through all the objects in the <code>Lines</code> slice, providing us with 2 variables at a time: the index of the slice, and the value itself. We don’t need the index, so we use the dummy variable <code>_</code>, and we store the value as <code>d</code>.</p>
<p>Next, we create an instance of <code>TMDBQuery</code> to store the results we’re about to get, and we grab the IMDB ID of our current result and store it as <code>imdbid</code>. In a moment we’ll start storing image URLs in the map we created earlier.</p>
<p>The next line looks a little odd: <code>_, ok := results.Images[imdbid]</code>, but all this does is check to see if we already have an image in our map for this IMDB ID. If we do, we skip the next section entirely. This saves time and API calls. For example, if we queried the dialog “Darmok”, we’ll get a few dozen results, but they’ll all have the same IMDB ID (they’re all from the same episode!). So we only need to get the image URL once.</p>
<p>Inside the next block, we’re only proceeding if there’s no match in the map (<code>if !ok</code>). We prep the API URL as <code>tmdbep</code> using the IMDB ID and our API key. The next few chunks of code should look very familiar. In the same way as we queried the original API, we send a request, do some error checking on the response, then use the JSON Decoder to store the results in a struct.</p>
<p>Finally, we check to see how many results are in <code>TvEpisodeResults</code>. If there’s more than zero, we know we have a match (and only one match). So we can grab that index and its image URL: <code>tmbdquery.TvEpisodeResults[0].StillPath</code>.</p>
<p>If for some reason there was no match, we’ve failed to find an image for this episode. So instead, we fall back to a hard-coded URL, which is a lovely full cast shot of our starship crew.</p>
<p>One last line of code to change! We need to send the new struct to our template. So replace:</p>
<pre><code class="lang-go">err = tpl.Execute(w, dialog)
</code></pre>
<p>with:</p>
<pre><code class="lang-go">err = tpl.Execute(w, results)
</code></pre>
<h2 id="heading-updating-the-template">Updating the template</h2>
<p>We’re so close! Because we’ve changed the struct that’s being passed to our template, we need to update the template itself. First, add the following to the end of <code>style.css</code>:</p>
<pre><code class="lang-css"><span class="hljs-selector-class">.episode-image</span> {
  <span class="hljs-attribute">width</span>: <span class="hljs-number">200px</span>;
  <span class="hljs-attribute">flex-grow</span>: <span class="hljs-number">0</span>;
  <span class="hljs-attribute">flex-shrink</span>: <span class="hljs-number">0</span>;
  <span class="hljs-attribute">margin-left</span>: <span class="hljs-number">20px</span>;
}
</code></pre>
<p>Now update the <code>&lt;section class="container"&gt;</code> part of the <code>index.html</code> file:</p>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">section</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"container"</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">ul</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"search-results"</span>&gt;</span>
    {{ range .Lines }}
    <span class="hljs-tag">&lt;<span class="hljs-name">li</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"dialog"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">h3</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"title"</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">"https://www.imdb.com/title/{{ .Imdbid }}/"</span>&gt;</span>{{ .Episodename }}<span class="hljs-tag">&lt;/<span class="hljs-name">a</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
        {{ if eq .Texttype "speech" }}
        <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"description"</span>&gt;</span>{{ .Who }}: <span class="hljs-tag">&lt;<span class="hljs-name">i</span>&gt;</span>"{{ .Text }}"<span class="hljs-tag">&lt;/<span class="hljs-name">i</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
        {{ else }}
        <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"description"</span>&gt;</span>{{ .Text }}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
        {{ end }}            
        <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>Act {{ .Act }} Scene {{ .Scenenumber }}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>Season {{ .Season }} Episode {{ .Episode }}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">img</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"episode-image"</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"https://image.tmdb.org/t/p/w454_and_h254_bestv2/{{ index $.Images .Imdbid }}"</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">li</span>&gt;</span>
    {{ end }}
  <span class="hljs-tag">&lt;/<span class="hljs-name">ul</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">section</span>&gt;</span>
</code></pre>
<p>The changes here are quite self-explanatory. Rather than range through a single object (as <code>.</code>) we use the <code>.Lines</code> field of the struct. To make things look a bit nicer, we’ve also removed the word “Episode” and again used the <code>Imbdid</code> to link to the actual IMDB page for an episode.</p>
<p>Then we add an image tag, using the start of the TMDB media URL, but appending a string we grab from the <code>Images</code> map using <code>Imdbid</code> as a key. Pretty neat, huh?</p>
<p>Now we’re finally done! Save everything and restart your Go server. I’m sure you know how by now :)</p>
<h2 id="heading-results">Results!</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723551302543/23ba0184-81c7-4ca8-ad10-3fa680f325f7.png" alt="Search results" class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We’ve successfully built a reasonably complex web application in Go, that queries 2 different HTTP APIs and performs some pretty advanced template rendering.</p>
<p>Does this mean that we’re now at least basically competent in Go? For me, the answer is No. There are still plenty of concepts I’ve yet to grasp. But the difference for me is that now I’ve actually <em>built</em> something.</p>
<p>Before, I would look at a Go program and shudder. I would attempt the <a target="_blank" href="https://tour.golang.org/welcome/1">Tour of Go</a> or <a target="_blank" href="https://gobyexample.com/">Go by Example</a> and just fall asleep mystified, as I had no context to apply to either of those rather verbose tutorials. Now I’ve seen a program work in the wild, and it all makes just a little bit more sense. It’s a solid place to start the continuing voyage of learning to Go!</p>
<p>For some next steps, I’d encourage you to try a couple of things on your own:</p>
<ul>
<li><p>Can you add a count to the number of results?</p>
</li>
<li><p>Can you pre-populate the search box with the search key when you show results? Don’t forget it’s already in the struct being passed to the template.</p>
</li>
</ul>
<p>If you’d like to try another step-by-step Go tutorial before jumping straight back into the comprehensive stuff, I’d also highly recommend Daniela Patruzalek’s <a target="_blank" href="https://github.com/danicat/pacgo">Pac Go</a> (a Pac Man clone written in Go). Dani was a huge inspiration for me in writing this series!</p>
<p>I really hope you’ve enjoyed following along. Let me know your thoughts, and share your own Go learning experience :)</p>
<p>If you need to check your code, here are the full gists: <a target="_blank" href="https://gist.github.com/timhberry/fd984770997ed76630600b42174c8500">main.go</a>, <a target="_blank" href="https://gist.github.com/timhberry/c4cf25a147dc755791b2825162f43895">index.html</a> and <a target="_blank" href="https://gist.github.com/timhberry/9d637ddaf583f474519ef1feff5a6e49">style.css</a>.</p>
]]></content:encoded></item></channel></rss>