Just before DevFest last year, I spoke at our local GDG meetup about the perils of persistent high availability storage in Kubernetes. In the talk we summarised the usual approaches in GKE including zonal and regional persistent disks, and how they operate with stateful and stateless deployments.
The bold elevator pitch of the talk was that storage should be run as its own service, to be consumed by your deployments, rather than built as part of them using native components. The thing is, talks these days need to make a bold statement, even if you don’t believe in it 100% :-)
There are a few different storage services available, but we quickly dismissed them:
- Cloud Storage - perfect for object storage, but not particularly performant and will usually require application refactoring (devs take note: don’t always expect access to a POSIX filesystem!)
- Cloud Filestore - an expensive but reliable and performant managed NFS solution. However, sadly it is still a zonal resource (please fix this Google!)
- Roll-your-own in Compute - the last resort of a desperate sysadmin. Get lost for days in buggy Puppet modules trying to string a GlusterFS cluster together. Then just pray you never need to change it.
So is there a better solution?
Ceph is an open-source project that provides massively scalable, software-defined storage systems on commodity hardware. It can provide object, block or file system storage, and automatically distributes and replicates data across multiple storage nodes to guarantee no single point of failure. It’s used by CERN to store petabytes of data generated by the Large Hadron Collider, so it’s probably good enough for us!
But what has this got to do with Kubernetes? Do we have to learn to build and configure yet another system?
Introducing Rook, an operator and orchestrator for Kubernetes that automates the provisioning, configuration, scaling, migration and disaster recovery of storage. Rook supports several backend providers and uses a consistent common framework across all of them. The Ceph provider for Rook is stable and production ready.
In a nutshell: Ceph is a massive resilient storage service, and Rook automates it for Kubernetes.
Now to the point of this post, let’s run Rook & Ceph on Kubernetes and see for ourselves how awesome it is! To follow along you’ll need a GCP project. We’ll create some billable resources for the duration of the tutorial, but you can delete them all as soon as we’re done. Make sure you have
gcloud set up locally (instructions here), which should include
kubectl, or to save time you can just use cloud shell.
Create a GKE Cluster
Let’s start by creating a regional GKE cluster with the
gcloud command. Note that we’ll run Ubuntu on our nodes, not COS, to give us a bit more flexibility. I’m creating my cluster in
us-central1 but feel free to choose another region.
gcloud container clusters create "demo-cluster" --region "us-central1" \ --no-enable-basic-auth --machine-type "n1-standard-2" --num-nodes "1" \ --image-type "UBUNTU" --disk-size "100" --disk-type "pd-standard" \ --no-issue-client-certificate --enable-autoupgrade --enable-ip-alias \ --addons HorizontalPodAutoscaling,HttpLoadBalancing \ --cluster-version "1.13.11-gke.15" --enable-autorepair \ --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append"
It will take a few minutes to create your cluster, after which you can set up your local
gcloud container clusters get-credentials demo-cluster --region=us-central1
Deploy the Rook Operator
Next we’ll deploy Rook’s Custom Resource Definitions (CRDs) and Operator to our cluster. Thankfully, all the manifests required for this are provided in Rook’s git repo. We can find what we need in the
git clone https://github.com/rook/rook cd rook/cluster/examples/kubernetes/ceph
Before we actually deploy these manifests, we need to make a quick change to the
operator.yaml file. Open it up and locate the following lines (somewhere around line 79):
# - name: FLEXVOLUME_DIR_PATH # value: "<PathToFlexVolumes>"
Change them to:
- name: FLEXVOLUME_DIR_PATH value: "/home/kubernetes/flexvolume"
- should be vertically aligned to the previous
# above it)
This allows Rook to use a local volume. Now we can create the CRDs and Operator:
kubectl create -f common.yaml kubectl create -f operator.yaml
It will take a while for all the pods involved in Rook’s operator to be happy. You can keep an eye on things with:
watch kubectl -n rook-ceph get pods
When you have a
rook-ceph-operator pod and 3
rook-discover pods, you’re ready to move on and you can
CTRL-C out of watching the output. At this point you might want to move out of the Rook git repo directory.
Create the Ceph cluster and storage class
Next we’ll create the Ceph cluster. This is basically our cluster of Ceph agents that provide the storage service. Create the following
This cluster manifest defines things like the version of Ceph we’re using, and where Ceph can store its data. We specify a
dataDirHostPath to tell it to use the local disk of the Kubernetes node it runs on, and on each node it will create a 10Gb block.
Next we create
CephBlockPool is how we create a replicated block storage service. As I mentioned before, block storage is just one type of storage that Ceph can provide, but this is what we need if we want to create virtual disks (or persistent volumes).
Finally we create a new storage class for Kubernetes that uses our block pool,
Note that the storage class references the block pool we just created. The provider also helpfully will create an xfs filesystem for us on any disks it creates. Now we’ve written these files, let’s create the objects, starting with the cluster:
kubectl create -f ceph-cluster.yaml
The Rook operator does a lot of work at this point to configure your Ceph cluster. You can use
watch kubectl -n rook-ceph get pods again to keep an eye on things. Wait about 3 or 4 minutes until everything is running happily, with no further pods initialising or creating. Then you can create the remaining objects:
kubectl create -f ceph-blockpool.yaml kubectl create -f ceph-storageclass.yaml
Once everything is up and running, we can consume the storage class to dynamically provision and claim persistent volumes backed by Ceph.
Deploy a test application: Cassandra
Let’s deploy Cassandra, so we actually have a useful application we can use to test our Ceph & Rook storage service. Apache Cassandra is an open-source, distributed, wide-column store NoSQL database. Like Ceph, it’s also designed to replicate across multiple commodity servers. To set it up for Kubernetes, first we create
Then we’ll create a StatefulSet for Cassandra. Stateful sets in Kubernetes are different to stateless deployments, in that they expect their associated volumes to persist. If something happens to a pod and it has to be recreated, it expects all of the previous data to still exist. So this is a great test for Ceph. Create the following
This is a pretty big manifest, but don’t worry too much about the details! We set up a bunch of ports and environment variables necessary for Cassandra to work, as well as configure a
readinessProbe that should allow it to gracefully handle any disruption to the cluster. Note the
volumeClaimTemplate at the end, where we ask for a 1Gb volume using the
rook-ceph-block storage class we created earlier.
Now create the Cassandra application:
kubectl create -f cassandra-service.yaml kubectl create -f cassandra-statefulset.yaml
You can watch the Cassandra pods come to life (we’re using the
wide option here so we can take note of our GKE node names):
watch kubectl get pods -o wide
After a while, your Cassandra stateful set should contain 3 happy pods:
NAME READY STATUS RESTARTS AGE IP NODE cassandra-0 1/1 Running 0 4m26s 10.28.1.17 gke-demo-cluster-default-pool-3983d5e5-7q2w cassandra-1 1/1 Running 0 3m34s 10.28.2.13 gke-demo-cluster-default-pool-9c04b3ed-0jrr cassandra-2 1/1 Running 0 116s 10.28.0.10 gke-demo-cluster-default-pool-f202df4d-ck5d
Make a note of the name of the node running the
Simulate a zone failure
To prove how useful Rook & Ceph are, we’ll simulate a zone failure. GKE will be forced to re-create a pod, but as it is part of a stateful set it will expect to still retain access to its data.
Before we do that, let’s check Cassandra is happy and healthy. We can do this by running the
nodetool command on one of our pods:
kubectl exec -it cassandra-0 -- nodetool status
You should see some output that shows Cassandra is happy with the 3 nodes in its cluster:
Datacenter: DC1-K8Demo ====================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.28.2.13 103.81 KiB 32 60.5% 41d5e80c-f2af-42bb-8bb2-f05d86819cf2 Rack1-K8Demo UN 10.28.0.10 89.9 KiB 32 72.0% 0ce6b369-212b-4dd5-a711-6a2fd2259a3b Rack1-K8Demo UN 10.28.1.17 104.55 KiB 32 67.4% 8a690630-407d-4b18-9342-741349155e59 Rack1-K8Demo
Next, we’ll cordon the node running the
cassandra-1 pod (replace with your node name from the output you got previously). This prevents Kubernetes from scheduling any more work on this node.
kubectl cordon gke-demo-cluster-default-pool-3983d5e5-7q2w
Now we’ll delete the
cassandra-1 pod. Before you go ahead and type the next command, think about what we’re doing:
- We’re deleting a pod in a
StatefulSet. The Kubernetes scheduler will try to recreate the missing pod in the same part of the set. In other words, it will try to create a new
- As it’s a
StatefulSet, it assumes there will be a stateful volume it can attach to, where its data should already be persisted
- On a single node cluster this normally wouldn’t be a problem, but each node of our cluster is in a different zone, and volumes backed by persistent disks are zonal resources
- It can’t recreate
cassandra-1on the same node because we cordoned it!
Run this command and watch what magically happens:
kubectl delete pod cassandra-1 watch kubectl get pods -o wide
Kubernetes has to schedule the new
cassandra-1 pod on a different node (as confirmed by the
wide output), but by some sorcery we don’t have an affinity conflict, and the pod comes up no problem!
You can confirm Cassandra is happy again with:
kubectl exec -it cassandra-0 -- nodetool status
By decoupling storage and running it as its own service, we can provide more resilience than if we were using native persistent disks tied to the stateful set or the nodes themselves. Rook and Ceph are providing the volumes directly, and can do so from any node in the cluster.
When you’ve finished, don’t forget to delete your cluster to avoid any ongoing charges.
That was a long tutorial! Congrats if you made it all the way through. I hope this has given you some ideas for how you can decouple storage in your own Kubernetes deployments (and maybe shown you some of the power of CRDs and Operators if you haven’t seen these before).
I’ll leave you with one last pinch of salt: After the talk I realised that Cassandra maybe wasn’t the best application to demo this with. While it does run as a stateful set, which was great for showing how Rook & Ceph overcome the normal restrictions of persistent stateful volumes, Cassandra itself is designed to replicate across multiple nodes and survive a single node failure. So in fact, in production, we may inadvertently introduce performance issues while trying to handle an outage. But I still think this is a worthy tutorial for the concepts it explores :-)
Thanks for reading and hopefully following along! Here are the slides from the talk, but they lose something without me sweating over a demo: