Kubernetes and Spark, shines together
In my previous post, I discussed how to write a simple Spark application in Kotlin, and run it with Airflow. This time around, let's see how we can run that same application on Kubernetes instead.
Like with writing Spark jobs in Kotlin, a lot of the nitty gritty work has already been done. This time around, it's in the form of an Operator developed (but in beta and not officially supported) by the Google Cloud Platform team. There is an alternative project which seems to do much the same thing. It doesn't seem to be quite as actively maintained, however, so we'll stick with the GCP version for now. First of all though, what actually is a Spark Operator?
The core idea of an Operator is to provide an extension to Kubernetes to manage and monitor workloads the same way a human operator (hence the name) might. That sounds ambitious, but there are a lot of these projects being actively developed and the concept seems to have really caught on in recent years. Check out OperatorHub for an overview.
Operators work by defining a custom resource, and then telling Kubernetes (through its API) how to handle that resource. In our case, the custom resource is SparkApplication, but it could also be things like a database cluster, Akka, or, if you really want to go meta, Kubeflow.
Circling back to the GCP Spark Operator, it boasts functions such as setting up the cluster for you according to a specification in YAML, improved integration between Hadoop and Kubernetes and metrics collection for Prometheus. Also, it's a Google project, so it's perhaps unsurprising that it also integrates with Google Cloud Storage to handle application dependencies.
So, that's the theory. let's take the same application we developed last time and see whether we can use this Operator to run it.
Working with the Operator
The key to using the Operator is to define a SparkApplication (the custom resource) in YAML. Before we can do that, however, we need to install the Operator. This is done through the Helm package manager for Kubernetes:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install spark-operator incubator/sparkoperator --set enableWebhook=true
That was easy enough, so now we can get to our application definition. The repository offers a number of simple examples to get us started, like this one. We'll need to do some more work on it, however. First off, we need to make sure the application has access to both the application jar and the PostgreSQL JDBC Driver jar, since the latter is used in the application. We do this by placing them in a Kubernetes Volume and mounting it to the driver and executors.
Note: There seem to be some issues at the moment with mounting volumes when using Minikube on Docker Desktop, as discussed here and here. I was able to use Docker Desktop's 'internal' Kubernetes to get it to work, so you might want to give that a try if Minikube is giving you problems.
Next, the official container image for the Operator (gcr.io/spark-operator/spark) comes with Java 8 installed. Our application is compiled with Java 11, however, so we'll need to address that. The solution to that particular problem is pretty simple, fortunately. We just need to customise the image a little:
This will install Java 11 and configure it to be used. Finally, putting it all together:
There is quite a lot going on here, so let's break it down. We begin by specifying that this is to be a SparkApplication and give some of the key parameters, such as the main class and the jar it can be found in. We also specify the custom image from before (I named it "books-img" because I am the very model of creativity) and tell Kubernetes not to try to pull it from any online repo and just use the local image instead (the "imagePullPolicy" line).
Next up, we define the driver and executor nodes. In this case, the only customisation is to mount the volume containing the jars and to set memory limits, but there are quite a few options to tinker with the pods these components will eventually run in.
We then list the JDBC Driver as a dependency and set configuration options for Spark to load it. Finally, we need to actually define the volume we're mounting. It's very simple in this case, just a local directory.
Ok, now that we have the Operator installed and the application defined, we can run the application. Fortunately, all the hard work has already been done. Actually starting the application could not be easier:
kubectl apply -f <yaml file>
It's that simple, really. The results are the same as last time around, as might be expected. The process, however, is very different, and by deploying the application on Kubernetes we can leverage the giant ecosystem that has been built up around it to make further development and deployment easier and more scalable.