If you’ve followed the steps in Part 1 and Part 2 of this series, you’ll have a working MicroK8s on the next-gen Ubuntu Core
First, let’s download some Apache Spark release binaries and adapt the dockerfile so that it plays nicely with pyspark:
wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xzf spark-3.1.2-bin-hadoop3.2.tgz
cd spark-3.1.2-bin-hadoop3.2
echo > Dockerfile <> /etc/pam.d/su &&
chgrp root /etc/passwd && chmod ug+rw /etc/passwd &&
rm -rf /var/cache/apt/*
COPY jars /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY kubernetes/dockerfiles/spark/decom.sh /opt/
COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data
RUN pip install pyspark
RUN pip install findspark
ENV SPARK_HOME /opt/spark
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh
ENTRYPOINT [ "/opt/entrypoint.sh" ]
# Specify the User that the actual main process will run as
USER ${spark_uid}
EOF
So Apache Spark runs in OCI containers on Kubernetes. Yes, that’s a thing now. Let’s build that container image so that we can launch it on our Ubuntu Core hosted MicroK8s in the sky:
sudo apt install docker.io
sudo docker build . --no-cache -t localhost:32000/spark-on-uk8s-on-core-20:1.0
We will use an SSH tunnel to push the image to our remote private registry on MicroK8s. Yep, it’s time to open another terminal and run the following commands so we can set up a tunnel to help us to do that:
GCE_IIP=$(gcloud compute instances list | grep ubuntu-core-20 | awk '{ print $5}')
UK8S_IP=$(ssh @$GCE_IIP sudo lxc list microk8s | grep microk8s | awk -F'|' '{ print $4 }' | awk -F' ' '{ print $1 }')
ssh -L 32000:$UK8S_IP:32000 ssh @$GCE_IIP
Make sure you leave that terminal open so that the tunnel stays up, and switch back to the one you were using before. The next step is to push the Apache Spark on Kubernetes container image we previously built to the private image registry we installed on MicroK8s, all running on our Ubuntu Core instance on Google cloud:
sudo docker push localhost:32000/spark-on-uk8s-on-core-20:1.0
And we’ll set up Jupyter so that we can launch and interact with Spark. Hop back into a terminal that has an SSH session open to the Ubuntu Core instance on GCE, and run the following command to launch a Jupyter notebook server on k8s. (Now would be a good time to stretch your legs because it’ll take a few minutes to complete)
sudo lxc exec microk8s -- sudo microk8s.kubectl run --port 6060
--port 37371 --port 8888 --image=ubuntu:20.04 jupyter --
bash -c "apt update && DEBIAN_FRONTEND=noninteractive apt install python3-pip wget openjdk-11-jre-headless -y && pip3 install jupyter && pip3 install pyspark && pip3 install findspark && wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz && tar xzf spark-3.1.2-bin-hadoop3.2.tgz && jupyter notebook --allow-root --ip '0.0.0.0' --port 8888 --NotebookApp.token='' --NotebookApp.password=''"
We need the CA certificate of our MicroK8s’ Kubernetes API to be available to Jupyter so that it will trust the encrypted connection with the API. So we’ll need to copy the CA cert over to our Jupyter container. Again, in the terminal with the SSH session open to the Ubuntu Core instance on GCE, run the following command:
sudo lxc exec microk8s -- sudo microk8s.kubectl cp /var/snap/microk8s/current/certs/ca.crt jupyter:.
Next, we’ll expose Jupyter as a Kubernetes service. We will then be able to set up another tunnel and reach it from our workstation’s browser, and our Spark executor instances will be able to call back to the Spark driver and block manager. Use the following commands:
sudo lxc exec microk8s -- sudo microk8s.kubectl expose pod jupyter --type=NodePort --name=jupyter-ext --port=8888
sudo lxc exec microk8s -- sudo microk8s.kubectl expose pod jupyter --type=ClusterIP --name=jupyter --port=37371,6060
Oh yes oh yes, I feel that it’s time for another SSH tunnel through to the LXC container so that we’ll be able to securely reach the Jupyter notebook server from our workstation’s browser over an encrypted link – remember to leave the terminal open so that the tunnel stays up.
J_PORT=$(ssh @$GCE_IIP sudo lxc exec microk8s -- sudo microk8s.kubectl get all | grep proxy-public | awk '{ print $5 }' | awk -F':' '{ print $2 }' | awk -F'/' '{ print $1 }')
ssh -L 8888:$UK8S_IP:$J_PORT ssh @$GCE_IIP
Now you should be able to browse to Jupyter using your workstation’s browser, nice and straightforward. Just follow the link:
In the newly opened Jupyter tab of your browser, create and launch a new iPython notebook, and add the following Python script. If all goes well, you’ll be able to launch a Spark cluster, connect to it, and execute a parallel calculation when you run the stanza.
import os
os.environ["SPARK_HOME"] = "/spark-3.1.2-bin-hadoop3.2"
import pyspark
import findspark
from pyspark import SparkContext, SparkConf
findspark.init()
conf = SparkConf().setAppName('spark-on-uk8s-on-core-20').setMaster('k8s://https://kubernetes.default.svc')
conf.set("spark.kubernetes.container.image", "localhost:32000/spark-on-uk8s-on-core-20:1.0")
conf.set("spark.kubernetes.allocation.batch.size", "50")
conf.set("spark.io.encryption.enabled", "true")
conf.set("spark.authenticate", "true")
conf.set("spark.network.crypto.enabled", "true")
conf.set("spark.executor.instances", "5")
conf.set('spark.kubernetes.authenticate.driver.caCertFile', '/ca.crt')
conf.set("spark.driver.host", "jupyter")
conf.set("spark.driver.port", "37371")
conf.set("spark.blockManager.port", "6060")
sc = SparkContext(conf=conf)
print(sc)
big_list = range(100000000000000)
rdd = sc.parallelize(big_list, 5)
odds = rdd.filter(lambda x: x % 2 != 0)
odds.take(20)
Ok, we didn’t do anything very advanced with our Spark cluster in the end. But in Part 4, we’ll take this still further. The next step is going to be all about banding together a bunch of these Ubuntu Core VM instances using LXD clustering and a virtual overlay fan network. With this approach, we will build your own highly available, fully distributed MicroK8s powered Kubernetes cluster. Turtles all the way down! It’d be cool to have that up on the cloud as a “hardcore” (pardon my punning), zero-trust-security hardened, alternative way to run all the things.
See you again in Part 4!
One of the most critical gaps in traditional Large Language Models (LLMs) is that they…
Canonical is continuously hiring new talent. Being a remote- first company, Canonical’s new joiners receive…
What is patching automation? With increasing numbers of vulnerabilities, there is a growing risk of…
Wouldn’t it be wonderful to wake up one day with a desire to explore AI…
Ubuntu and Ubuntu Pro supports Microsoft’s Azure Cobalt 100 Virtual Machines (VMs), powered by their…
Welcome to the Ubuntu Weekly Newsletter, Issue 870 for the week of December 8 –…