Categories: BlogCanonicalUbuntu

How to run Apache Spark on MicroK8s and Ubuntu Core, in the cloud: Part 4

In this series, we’ve been building up an Apache Spark cluster on Kubernetes using MicroK8s, Ubuntu Core OS, LXD

Sponsored
and GCP. We’ve learned about and set up nested virtualisation on the cloud, and had some fun. But right, it’s retrospective time: in Part 1, we saw how to get MicroK8s up on LXD, on Ubuntu Core using Multipass. In Part 2, we looked at getting the same setup running under nested virtualisation on the cloud. And in Part 3, we bunged an Apache Spark cluster on top and interrogated it from a Jupyter notebook. All in a day’s work!

But we’re not finished quite yet. The setup we’ve built so far is vertically scalable. But we want to be able to scale horizontally and add more Ubuntu Core instances to a cluster in order to handle big data workloads with Spark. So in this final Part 4 of the series, we’ll get to that using LXD clustering and fan networking.

Fan networking builds a virtual, overlay software-defined network on top of the real, underlay network, so that VMs and Linux Containers can communicate with one another across hosts seamlessly. LXD’s fan networking implementation can handle up to a theoretical ~16 million hosts, with 65k VMs or Linux Containers per host. So for our Spark cluster, it should be good!

The first step is to deploy three Ubuntu Core VMs on GCE. We’ll take a smaller instance size this time, so this project shouldn’t end up costing you more than a regular oat milk latte from your favourite coffee shop. You can terminate and delete the VM we used the last time if you want, and we will start afresh. While we’re here, we’ll also adjust the MTU for the VPC so that everything works as expected.

Use the following commands:

gcloud compute networks update default --mtu=1500

for x in {0..2};
do
gcloud beta compute instances create ubuntu-core-20-$x --zone=europe-west1-b --machine-type=n2-standard-2 --network-interface network=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --min-cpu-platform="Intel Cascade Lake" --image=ubuntu-core-20-secureboot --boot-disk-size=60GB --boot-disk-type=pd-balanced --boot-disk-device-name=ubuntu-core-20-$x --shielded-secure-boot --no-shielded-vtpm --no-shielded-integrity-monitoring --reservation-affinity=any --enable-nested-virtualization --create-disk=size=60,mode=rw,auto-delete=yes,name=storage-disk-$x,device-name=storage-disk-$x
done

Cool! We got the three systems up and deployed! Since we preinstalled LXD in the VM disk image back in Part 2, we can just log into the first one and initialize the LXD cluster and the fan network from it.

VLAN Party: building the LXD cluster and fan network

Time to cluster up kids, here we go! Run the following commands to initialize the cluster:

GCE_IIP0=$(gcloud compute instances list | grep ubuntu-core-20-0 | awk '{ print $5 }')
INT0_IP=$(gcloud compute instances list | grep ubuntu-core-20-0 | awk '{ print $4 }')
SUBNET=$(echo $INT0_IP | sed 's/.[0-9]*$/.0/')

cat > preseed-0.yaml <@$GCE_IIP0:.
ssh @$GCE_IIP0 "cat preseed-0.yaml | sudo lxd init --preseed"

Oh wow, that’s the first node bootstrapped. On to building the rest of the LXD cluster now. We just need the cert data from the bootstrap server in order to get going. We’ll build preseed files and apply them to the other two nodes:

Sponsored
GCE_IIP1=$(gcloud compute instances list | grep ubuntu-core-20-1 | awk '{ print $5 }')
GCE_IIP2=$(gcloud compute instances list | grep ubuntu-core-20-2 | awk '{ print $5 }')

INT1_IP=$(gcloud compute instances list | grep ubuntu-core-20-1 | awk '{ print $4 }')
INT2_IP=$(gcloud compute instances list | grep ubuntu-core-20-2 | awk '{ print $4 }')

CERT_DATA=$(ssh @$GCE_IIP0 "sudo sed 's/$/n/g' /writable/system-data/var/snap/lxd/common/lxd/cluster.crt")

cat > preseed-1.yaml < preseed-2.yaml <@$GCE_IIP1:.
ssh -t @$GCE_IIP1 "cat preseed-1.yaml | sudo lxd init --preseed"

scp preseed-2.yaml @$GCE_IIP2:.
ssh -t @$GCE_IIP2 "cat preseed-2.yaml | sudo lxd init --preseed"

Alive! she cried: deploying MicroK8s

Your LXD cluster should be alive now. Alive! Now we can pour MicroK8s over the top, but this time all clustered and highly available. Since LXD has made our Ubuntu Core instances operate as one, we can launch several MicroK8s instances from one host, and they’ll be distributed across the LXD cluster as needed.

One gotcha to be aware of is that our LXD fan network, like our MicroK8s Calico network, is making use of VXLAN networking, meaning that we will need to adjust the MTUs at each layer. We already changed the MTU for the VPC to 1500 in a previous step. In the next steps, we’ll change the MTU for the VMs running MicroK8s to 1450, and the MTU for the Calico network to 1400, as well as deploying and clustering MicroK8s:

for x in {0..2};
do
ssh -t @$GCE0_IIP "sudo lxc init ubuntu:20.04 microk8s-$x --vm -c limits.memory=6GB -c limits.cpu=2"
ssh -t @$GCE0_IIP "sudo lxc config device override microk8s-$x root size=40GB"
ssh -t @$GCE0_IIP "sudo lxc start microk8s-$x"
sleep 90 # let it boot
ssh -t @$GCE0_IIP "sudo lxc exec microk8s-$x -- echo '            mtu: 1450' | sudo tee -a /etc/netplan/50-cloud-init.yaml"
ssh -t @$GCE0_IIP "sudo lxc exec microk8s-$x -- sudo netplan apply"
ssh -t @$GCE0_IIP "sudo exec microk8s-$x -- sudo snap install microk8s --classic"
ssh -t @$GCE0_IIP sudo exec microk8s-$x -- microk8s.kubectl patch configmap/calico-config -n kube-system --type merge -p '{"data":{"veth_mtu": "1400"}}'
ssh -t @$GCE0_IIP sudo exec microk8s-$x -- microk8s.kubectl rollout restart daemonset calico-node -n kube-system
done

for x in {1..2};
do
JOIN_CMD=$(ssh -t @$GCE_IIP0 "sudo lxc exec microk8s-0 -- microk8s add-node" | grep "microk8s join" | tail -n 1)

ssh -t @$GCE_IIP0 "sudo lxc exec microk8s-$x -- $JOIN_CMD"
done

Just a highly available Kubernetes cluster, on a highly available LXD cluster, on a bunch of next-gen Ubuntu Core OS boxes using nested virtualisation! To get Spark deployed and running on the top, we can replay the steps we did in Part 3 of this series.

Caesar wrap: review

To recap:

  • In Part 1, we used Multipass to spin up an Ubuntu Core VM on our workstation, and then we deployed MicroK8s on it using LXD and a Linux Container.
  • In Part 2, we took that work to the GCP cloud, and we made it even better by using a nested VM to run MicroK8s inside our Ubuntu Core instance, again using LXD.
  • In Part 3, we put Apache Spark on top and queried it from a Jupyter Notebook server.
  • And finally, in this Part 4, we made the solution fully distributed and highly available using LXD clustering and fan networking.

Thanks for being such great company on this journey – but now it’s your turn to take this further. I want you to have some fun, and don’t forget to tag us on Twitter (@ubuntu) with some examples of your nicest projects!

Contact us to learn more about MicroK8s, LXD, Ubuntu Core or any of the technologies that we’ve explored in this series. We’ll gladly set up a call so that you can find out more about how our technologies and services can help you to jump-start your next implementation.

Ubuntu Server Admin

Recent Posts

Building RAG with enterprise open source AI infrastructure

One of the most critical gaps in traditional Large Language Models (LLMs) is that they…

14 hours ago

Life at Canonical: Victoria Antipova’s perspective as a new joiner in Product Marketing

Canonical is continuously hiring new talent. Being a remote- first company, Canonical’s new joiners receive…

2 days ago

What is patching automation?

What is patching automation? With increasing numbers of vulnerabilities, there is a growing risk of…

3 days ago

A beginner’s tutorial for your first Machine Learning project using Charmed Kubeflow

Wouldn’t it be wonderful to wake up one day with a desire to explore AI…

4 days ago

Ubuntu brings comprehensive support to Azure Cobalt 100 VMs

Ubuntu and Ubuntu Pro supports Microsoft’s Azure Cobalt 100 Virtual Machines (VMs), powered by their…

4 days ago

Ubuntu Weekly Newsletter Issue 870

Welcome to the Ubuntu Weekly Newsletter, Issue 870 for the week of December 8 –…

4 days ago