Create a Kubernetes storage service with Rook & Ceph

Just before DevFest last year, I spoke at our local GDG meetup about the perils of persistent high availability storage in Kubernetes. In the talk we summarised the usual approaches in GKE including zonal and regional persistent disks, and how they operate with stateful and stateless deployments.

The bold elevator pitch of the talk was that storage should be run as its own service, to be consumed by your deployments, rather than built as part of them using native components. The thing is, talks these days need to make a bold statement, even if you don’t believe in it 100% :-)

Storage services Link to heading

There are a few different storage services available, but we quickly dismissed them:

Cloud Storage - perfect for object storage, but not particularly performant and will usually require application refactoring (devs take note: don’t always expect access to a POSIX filesystem!)
Cloud Filestore - an expensive but reliable and performant managed NFS solution. However, sadly it is still a zonal resource (please fix this Google!)
Roll-your-own in Compute - the last resort of a desperate sysadmin. Get lost for days in buggy Puppet modules trying to string a GlusterFS cluster together. Then just pray you never need to change it.

So is there a better solution?

Ceph Link to heading

Ceph is an open-source project that provides massively scalable, software-defined storage systems on commodity hardware. It can provide object, block or file system storage, and automatically distributes and replicates data across multiple storage nodes to guarantee no single point of failure. It’s used by CERN to store petabytes of data generated by the Large Hadron Collider, so it’s probably good enough for us!

But what has this got to do with Kubernetes? Do we have to learn to build and configure yet another system?

Rook Link to heading

Introducing Rook, an operator and orchestrator for Kubernetes that automates the provisioning, configuration, scaling, migration and disaster recovery of storage. Rook supports several backend providers and uses a consistent common framework across all of them. The Ceph provider for Rook is stable and production ready.

In a nutshell: Ceph is a massive resilient storage service, and Rook automates it for Kubernetes.

Now to the point of this post, let’s run Rook & Ceph on Kubernetes and see for ourselves how awesome it is! To follow along you’ll need a GCP project. We’ll create some billable resources for the duration of the tutorial, but you can delete them all as soon as we’re done. Make sure you have gcloud set up locally (instructions here), which should include kubectl, or to save time you can just use cloud shell.

Create a GKE Cluster Link to heading

Let’s start by creating a regional GKE cluster with the gcloud command. Note that we’ll run Ubuntu on our nodes, not COS, to give us a bit more flexibility. I’m creating my cluster in us-central1 but feel free to choose another region.

gcloud container clusters create "demo-cluster" --region "us-central1" \
--no-enable-basic-auth --machine-type "n1-standard-2" --num-nodes "1" \
--image-type "UBUNTU" --disk-size "100" --disk-type "pd-standard" \
--no-issue-client-certificate --enable-autoupgrade --enable-ip-alias \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--cluster-version "1.13.11-gke.15" --enable-autorepair \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append"

It will take a few minutes to create your cluster, after which you can set up your local kubectl with:

gcloud container clusters get-credentials demo-cluster --region=us-central1

Deploy the Rook Operator Link to heading

Next we’ll deploy Rook’s Custom Resource Definitions (CRDs) and Operator to our cluster. Thankfully, all the manifests required for this are provided in Rook’s git repo. We can find what we need in the examples/ceph directory:

git clone https://github.com/rook/rook
cd rook/cluster/examples/kubernetes/ceph

Before we actually deploy these manifests, we need to make a quick change to the operator.yaml file. Open it up and locate the following lines (somewhere around line 79):

# - name: FLEXVOLUME_DIR_PATH
#   value: "<PathToFlexVolumes>"

Change them to:

- name: FLEXVOLUME_DIR_PATH
  value: "/home/kubernetes/flexvolume"

(The - should be vertically aligned to the previous # above it)

This allows Rook to use a local volume. Now we can create the CRDs and Operator:

kubectl create -f common.yaml
kubectl create -f operator.yaml

It will take a while for all the pods involved in Rook’s operator to be happy. You can keep an eye on things with:

watch kubectl -n rook-ceph get pods

When you have a rook-ceph-operator pod and 3 rook-discover pods, you’re ready to move on and you can CTRL-C out of watching the output. At this point you might want to move out of the Rook git repo directory.

Create the Ceph cluster and storage class Link to heading

Next we’ll create the Ceph cluster. This is basically our cluster of Ceph agents that provide the storage service. Create the following ceph-cluster.yaml:

This cluster manifest defines things like the version of Ceph we’re using, and where Ceph can store its data. We specify a dataDirHostPath to tell it to use the local disk of the Kubernetes node it runs on, and on each node it will create a 10Gb block.

Next we create ceph-blockpool.yaml:

A CephBlockPool is how we create a replicated block storage service. As I mentioned before, block storage is just one type of storage that Ceph can provide, but this is what we need if we want to create virtual disks (or persistent volumes).

Finally we create a new storage class for Kubernetes that uses our block pool, ceph-storageclass.yaml:

Note that the storage class references the block pool we just created. The provider also helpfully will create an xfs filesystem for us on any disks it creates. Now we’ve written these files, let’s create the objects, starting with the cluster:

kubectl create -f ceph-cluster.yaml

The Rook operator does a lot of work at this point to configure your Ceph cluster. You can use watch kubectl -n rook-ceph get pods again to keep an eye on things. Wait about 3 or 4 minutes until everything is running happily, with no further pods initialising or creating. Then you can create the remaining objects:

kubectl create -f ceph-blockpool.yaml
kubectl create -f ceph-storageclass.yaml

Once everything is up and running, we can consume the storage class to dynamically provision and claim persistent volumes backed by Ceph.

Deploy a test application: Cassandra Link to heading

Let’s deploy Cassandra, so we actually have a useful application we can use to test our Ceph & Rook storage service. Apache Cassandra is an open-source, distributed, wide-column store NoSQL database. Like Ceph, it’s also designed to replicate across multiple commodity servers. To set it up for Kubernetes, first we create cassandra-service.yaml:

Then we’ll create a StatefulSet for Cassandra. Stateful sets in Kubernetes are different to stateless deployments, in that they expect their associated volumes to persist. If something happens to a pod and it has to be recreated, it expects all of the previous data to still exist. So this is a great test for Ceph. Create the following cassandra-statefulset.yaml:

This is a pretty big manifest, but don’t worry too much about the details! We set up a bunch of ports and environment variables necessary for Cassandra to work, as well as configure a readinessProbe that should allow it to gracefully handle any disruption to the cluster. Note the volumeClaimTemplate at the end, where we ask for a 1Gb volume using the rook-ceph-block storage class we created earlier.

Now create the Cassandra application:

kubectl create -f cassandra-service.yaml
kubectl create -f cassandra-statefulset.yaml

You can watch the Cassandra pods come to life (we’re using the wide option here so we can take note of our GKE node names):

watch kubectl get pods -o wide

After a while, your Cassandra stateful set should contain 3 happy pods:

NAME          READY   STATUS    RESTARTS   AGE     IP           NODE   
cassandra-0   1/1     Running   0          4m26s   10.28.1.17   gke-demo-cluster-default-pool-3983d5e5-7q2w
cassandra-1   1/1     Running   0          3m34s   10.28.2.13   gke-demo-cluster-default-pool-9c04b3ed-0jrr
cassandra-2   1/1     Running   0          116s    10.28.0.10   gke-demo-cluster-default-pool-f202df4d-ck5d

Make a note of the name of the node running the cassandra-1 pod.

Simulate a zone failure Link to heading

To prove how useful Rook & Ceph are, we’ll simulate a zone failure. GKE will be forced to re-create a pod, but as it is part of a stateful set it will expect to still retain access to its data.

Before we do that, let’s check Cassandra is happy and healthy. We can do this by running the nodetool command on one of our pods:

kubectl exec -it cassandra-0 -- nodetool status

You should see some output that shows Cassandra is happy with the 3 nodes in its cluster:

Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.28.2.13  103.81 KiB  32           60.5%             41d5e80c-f2af-42bb-8bb2-f05d86819cf2  Rack1-K8Demo
UN  10.28.0.10  89.9 KiB   32           72.0%             0ce6b369-212b-4dd5-a711-6a2fd2259a3b  Rack1-K8Demo
UN  10.28.1.17  104.55 KiB  32           67.4%             8a690630-407d-4b18-9342-741349155e59  Rack1-K8Demo

Next, we’ll cordon the node running the cassandra-1 pod (replace with your node name from the output you got previously). This prevents Kubernetes from scheduling any more work on this node.

kubectl cordon gke-demo-cluster-default-pool-3983d5e5-7q2w

Now we’ll delete the cassandra-1 pod. Before you go ahead and type the next command, think about what we’re doing:

We’re deleting a pod in a StatefulSet. The Kubernetes scheduler will try to recreate the missing pod in the same part of the set. In other words, it will try to create a new cassandra-1 (not cassandra-3)
As it’s a StatefulSet, it assumes there will be a stateful volume it can attach to, where its data should already be persisted
On a single node cluster this normally wouldn’t be a problem, but each node of our cluster is in a different zone, and volumes backed by persistent disks are zonal resources
It can’t recreate cassandra-1 on the same node because we cordoned it!

Run this command and watch what magically happens:

kubectl delete pod cassandra-1
watch kubectl get pods -o wide

Kubernetes has to schedule the new cassandra-1 pod on a different node (as confirmed by the wide output), but by some sorcery we don’t have an affinity conflict, and the pod comes up no problem!

You can confirm Cassandra is happy again with:

kubectl exec -it cassandra-0 -- nodetool status

By decoupling storage and running it as its own service, we can provide more resilience than if we were using native persistent disks tied to the stateful set or the nodes themselves. Rook and Ceph are providing the volumes directly, and can do so from any node in the cluster.

When you’ve finished, don’t forget to delete your cluster to avoid any ongoing charges.

Debrief Link to heading

That was a long tutorial! Congrats if you made it all the way through. I hope this has given you some ideas for how you can decouple storage in your own Kubernetes deployments (and maybe shown you some of the power of CRDs and Operators if you haven’t seen these before).

I’ll leave you with one last pinch of salt: After the talk I realised that Cassandra maybe wasn’t the best application to demo this with. While it does run as a stateful set, which was great for showing how Rook & Ceph overcome the normal restrictions of persistent stateful volumes, Cassandra itself is designed to replicate across multiple nodes and survive a single node failure. So in fact, in production, we may inadvertently introduce performance issues while trying to handle an outage. But I still think this is a worthy tutorial for the concepts it explores :-)

Thanks for reading and hopefully following along! Here are the slides from the talk, but they lose something without me sweating over a demo: