Just before DevFest last year, I spoke at our local GDG meetup about the perils of persistent high availability storage in Kubernetes. In the talk we summarised the usual approaches in GKE including zonal and regional persistent disks, and how they operate with stateful and stateless deployments.
The bold elevator pitch of the talk was that storage should be run as its own service, to be consumed by your deployments, rather than built as part of them using native components. The thing is, talks these days need to make a bold statement, even if you don’t believe in it 100% :-)
Storage services
There are a few different storage services available, but we quickly dismissed them:
Cloud Storage - perfect for object storage, but not particularly performant and will usually require application refactoring (devs take note: don’t always expect access to a POSIX filesystem!)
Cloud Filestore - an expensive but reliable and performant managed NFS solution. However, sadly it is still a zonal resource (please fix this Google!)
Roll-your-own in Compute - the last resort of a desperate sysadmin. Get lost for days in buggy Puppet modules trying to string a GlusterFS cluster together. Then just pray you never need to change it.
So is there a better solution?
Ceph
Ceph is an open-source project that provides massively scalable, software-defined storage systems on commodity hardware. It can provide object, block or file system storage, and automatically distributes and replicates data across multiple storage nodes to guarantee no single point of failure. It’s used by CERN to store petabytes of data generated by the Large Hadron Collider, so it’s probably good enough for us!
But what has this got to do with Kubernetes? Do we have to learn to build and configure yet another system?
Rook
Introducing Rook, an operator and orchestrator for Kubernetes that automates the provisioning, configuration, scaling, migration and disaster recovery of storage. Rook supports several backend providers and uses a consistent common framework across all of them. The Ceph provider for Rook is stable and production ready.
In a nutshell: Ceph is a massive resilient storage service, and Rook automates it for Kubernetes.
Now to the point of this post, let’s run Rook & Ceph on Kubernetes and see for ourselves how awesome it is! To follow along you’ll need a GCP project. We’ll create some billable resources for the duration of the tutorial, but you can delete them all as soon as we’re done. Make sure you have gcloud
set up locally (instructions here), which should include kubectl
, or to save time you can just use cloud shell.
Create a GKE Cluster
Let’s start by creating a regional GKE cluster with the gcloud
command. Note that we’ll run Ubuntu on our nodes, not COS, to give us a bit more flexibility. I’m creating my cluster in us-central1
but feel free to choose another region.
gcloud container clusters create "demo-cluster" --region "us-central1" --no-enable-basic-auth --machine-type "n1-standard-2" --num-nodes "1" --image-type "UBUNTU" --disk-size "100" --disk-type "pd-standard" --no-issue-client-certificate --enable-autoupgrade --enable-ip-alias --addons HorizontalPodAutoscaling,HttpLoadBalancing --cluster-version "1.13.11-gke.15" --enable-autorepair --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append"
It will take a few minutes to create your cluster, after which you can set up your local kubectl
with:
gcloud container clusters get-credentials demo-cluster --region=us-central1
Deploy the Rook Operator
Next we’ll deploy Rook’s Custom Resource Definitions (CRDs) and Operator to our cluster. Thankfully, all the manifests required for this are provided in Rook’s git repo. We can find what we need in the examples/ceph
directory:
git clone https://github.com/rook/rook
cd rook/cluster/examples/kubernetes/ceph
Before we actually deploy these manifests, we need to make a quick change to the operator.yaml
file. Open it up and locate the following lines (somewhere around line 79):
# - name: FLEXVOLUME_DIR_PATH
# value: ""
Change them to:
- name: FLEXVOLUME_DIR_PATH
value: "/home/kubernetes/flexvolume"
(The -
should be vertically aligned to the previous #
above it)
This allows Rook to use a local volume. Now we can create the CRDs and Operator:
kubectl create -f common.yaml
kubectl create -f operator.yaml
It will take a while for all the pods involved in Rook’s operator to be happy. You can keep an eye on things with:
watch kubectl -n rook-ceph get pods
When you have a rook-ceph-operator
pod and 3 rook-discover
pods, you’re ready to move on and you can CTRL-C
out of watching the output. At this point you might want to move out of the Rook git repo directory.
Create the Ceph cluster and storage class
Next we’ll create the Ceph cluster. This is basically our cluster of Ceph agents that provide the storage service. Create the following ceph-cluster.yaml
:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: ceph/ceph:v14.2
dataDirHostPath: /var/lib/rook
mon:
count: 3
dashboard:
enabled: true
storage:
storageClassDeviceSets:
- name: set1
count: 3
portable: false
volumeClaimTemplates:
- metadata:
name: data
spec:
resources:
requests:
storage: 10Gi
volumeMode: Block
accessModes:
- ReadWriteOnce
This cluster manifest defines things like the version of Ceph we’re using, and where Ceph can store its data. We specify a dataDirHostPath
to tell it to use the local disk of the Kubernetes node it runs on, and on each node it will create a 10Gb block.
Next we create ceph-blockpool.yaml
:
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
failureDomain: host
replicated:
size: 3
A CephBlockPool
is how we create a replicated block storage service. As I mentioned before, block storage is just one type of storage that Ceph can provide, but this is what we need if we want to create virtual disks (or persistent volumes).
Finally we create a new storage class for Kubernetes that uses our block pool, ceph-storageclass.yaml
:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph
pool: replicapool
imageFormat: "2"
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
csi.storage.k8s.io/fstype: xfs
reclaimPolicy: Delete
Note that the storage class references the block pool we just created. The provider also helpfully will create an xfs filesystem for us on any disks it creates. Now we’ve written these files, let’s create the objects, starting with the cluster:
kubectl create -f ceph-cluster.yaml
The Rook operator does a lot of work at this point to configure your Ceph cluster. You can use watch kubectl -n rook-ceph get pods
again to keep an eye on things. Wait about 3 or 4 minutes until everything is running happily, with no further pods initialising or creating. Then you can create the remaining objects:
kubectl create -f ceph-blockpool.yaml
kubectl create -f ceph-storageclass.yaml
Once everything is up and running, we can consume the storage class to dynamically provision and claim persistent volumes backed by Ceph.
Deploy a test application: Cassandra
Let’s deploy Cassandra, so we actually have a useful application we can use to test our Ceph & Rook storage service. Apache Cassandra is an open-source, distributed, wide-column store NoSQL database. Like Ceph, it’s also designed to replicate across multiple commodity servers. To set it up for Kubernetes, first we create cassandra-service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: cassandra
labels:
app: cassandra
spec:
clusterIP: None
ports:
- port: 9042
selector:
app: cassandra
Then we’ll create a StatefulSet for Cassandra. Stateful sets in Kubernetes are different to stateless deployments, in that they expect their associated volumes to persist. If something happens to a pod and it has to be recreated, it expects all of the previous data to still exist. So this is a great test for Ceph. Create the following cassandra-statefulset.yaml
:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
labels:
app: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
terminationGracePeriodSeconds: 1800
containers:
- name: cassandra
image: gcr.io/google-samples/cassandra:v13
imagePullPolicy: Always
ports:
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
- containerPort: 7199
name: jmx
- containerPort: 9042
name: cql
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
securityContext:
capabilities:
add: ["IPC_LOCK"]
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
env:
- name: MAX_HEAP_SIZE
value: 512M
- name: HEAP_NEWSIZE
value: 100M
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.default.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "K8Demo"
- name: CASSANDRA_DC
value: "DC1-K8Demo"
- name: CASSANDRA_RACK
value: "Rack1-K8Demo"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
initialDelaySeconds: 15
timeoutSeconds: 5
volumeMounts:
- name: cassandra-data
mountPath: /cassandra_data
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: rook-ceph-block
resources:
requests:
storage: 1Gi
This is a pretty big manifest, but don’t worry too much about the details! We set up a bunch of ports and environment variables necessary for Cassandra to work, as well as configure a readinessProbe
that should allow it to gracefully handle any disruption to the cluster. Note the volumeClaimTemplate
at the end, where we ask for a 1Gb volume using the rook-ceph-block
storage class we created earlier.
Now create the Cassandra application:
kubectl create -f cassandra-service.yaml
kubectl create -f cassandra-statefulset.yaml
You can watch the Cassandra pods come to life (we’re using the wide
option here so we can take note of our GKE node names):
watch kubectl get pods -o wide
After a while, your Cassandra stateful set should contain 3 happy pods:
NAME READY STATUS RESTARTS AGE IP NODE
cassandra-0 1/1 Running 0 4m26s 10.28.1.17 gke-demo-cluster-default-pool-3983d5e5-7q2w
cassandra-1 1/1 Running 0 3m34s 10.28.2.13 gke-demo-cluster-default-pool-9c04b3ed-0jrr
cassandra-2 1/1 Running 0 116s 10.28.0.10 gke-demo-cluster-default-pool-f202df4d-ck5d
Make a note of the name of the node running the cassandra-1
pod.
Simulate a zone failure
To prove how useful Rook & Ceph are, we’ll simulate a zone failure. GKE will be forced to re-create a pod, but as it is part of a stateful set it will expect to still retain access to its data.
Before we do that, let’s check Cassandra is happy and healthy. We can do this by running the nodetool
command on one of our pods:
kubectl exec -it cassandra-0 -- nodetool status
You should see some output that shows Cassandra is happy with the 3 nodes in its cluster:
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.28.2.13 103.81 KiB 32 60.5% 41d5e80c-f2af-42bb-8bb2-f05d86819cf2 Rack1-K8Demo
UN 10.28.0.10 89.9 KiB 32 72.0% 0ce6b369-212b-4dd5-a711-6a2fd2259a3b Rack1-K8Demo
UN 10.28.1.17 104.55 KiB 32 67.4% 8a690630-407d-4b18-9342-741349155e59 Rack1-K8Demo
Next, we’ll cordon the node running the cassandra-1
pod (replace with your node name from the output you got previously). This prevents Kubernetes from scheduling any more work on this node.
kubectl cordon gke-demo-cluster-default-pool-3983d5e5-7q2w
Now we’ll delete the cassandra-1
pod. Before you go ahead and type the next command, think about what we’re doing:
We’re deleting a pod in a
StatefulSet
. The Kubernetes scheduler will try to recreate the missing pod in the same part of the set. In other words, it will try to create a newcassandra-1
(notcassandra-3
)As it’s a
StatefulSet
, it assumes there will be a stateful volume it can attach to, where its data should already be persistedOn a single node cluster this normally wouldn’t be a problem, but each node of our cluster is in a different zone, and volumes backed by persistent disks are zonal resources
It can’t recreate
cassandra-1
on the same node because we cordoned it!
Run this command and watch what magically happens:
kubectl delete pod cassandra-1
watch kubectl get pods -o wide
Kubernetes has to schedule the new cassandra-1
pod on a different node (as confirmed by the wide
output), but by some sorcery we don’t have an affinity conflict, and the pod comes up no problem!
You can confirm Cassandra is happy again with:
kubectl exec -it cassandra-0 -- nodetool status
By decoupling storage and running it as its own service, we can provide more resilience than if we were using native persistent disks tied to the stateful set or the nodes themselves. Rook and Ceph are providing the volumes directly, and can do so from any node in the cluster.
When you’ve finished, don’t forget to delete your cluster to avoid any ongoing charges.
Debrief
That was a long tutorial! Congrats if you made it all the way through. I hope this has given you some ideas for how you can decouple storage in your own Kubernetes deployments (and maybe shown you some of the power of CRDs and Operators if you haven’t seen these before).
I’ll leave you with one last pinch of salt: After the talk I realised that Cassandra maybe wasn’t the best application to demo this with. While it does run as a stateful set, which was great for showing how Rook & Ceph overcome the normal restrictions of persistent stateful volumes, Cassandra itself is designed to replicate across multiple nodes and survive a single node failure. So in fact, in production, we may inadvertently introduce performance issues while trying to handle an outage. But I still think this is a worthy tutorial for the concepts it explores :-)
Thanks for reading and hopefully following along! Here are the slides from the talk, but they lose something without me sweating over a demo: