I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition. When the Olympics took place in Paris last summer, I suddenly began rooting for my country in sports I barely knew existed. I would spend random Thursday nights glued to the TV, cheering on athletes as they skillfully let fly arrows, disks, hammers and so on.
It might have been the Olympic spirit coursing through me, or perhaps my deep fascination with data and mathematics, but at the time I found myself drawing parallels between Spark’s architecture and the world of sports. So here I am, ready to explain why Spark outranks its competitors and crosses the finish line first.
Let’s start with a general overview.
Spark and Hadoop: comparing strengths
Apache Spark is an open-source, distributed processing system by the Apache Software Foundation that allows large amounts of data to be processed efficiently. Before its introduction, other tools existed, but Spark solved several performance problems related to processing large datasets, effectively becoming the number one choice in the industry. Spark was born as a project which aimed to maintain the benefits of past technology while making everything more efficient. In particular Spark was created as a direct competitor to Hadoop, a java-based open source big data software platform, and aimed to maintain the scalable, distributed, fault-tolerant data processing provided by Hadoop’s MapReduce, while boosting performance.
Let’s delve into the technicalities to understand a little more about Spark’s effectiveness.
Spark’s strength is distributed computing, meaning that when you execute your spark program, Spark will orchestrate your code across a distributed cluster of nodes. This feature is particularly useful when a significant amount of data has to be handled and is exactly what makes Spark a champion for operations on large datasets. A practical example would be IoT devices found in car models distributed all over the world. These devices collect hundreds of thousands of data points every day, from which it is necessary to extract useful information, such as the average temperature of certain components. To obtain such statistics, it is necessary to perform operations such as grouping or filtering, perhaps to remove noise from the data.
Spark’s accessibility and adaptability are direct consequences of the Spark Core, which serves as the foundation of the entire Apache Spark ecosystem, handling distributed tasks, execution, scheduling, and basic input output operations. In addition, Spark’s convenience is made possible by usability via several supported APIs, such as Java, Scala, Python and R.
Spark offers four main built-in libraries: Spark SQL, Spark Streaming, MLlib and GraphX. These libraries provide a large set of functionalities for different operations, such as data streaming, dataset handling, and machine learning:
- Spark SQL allows you to query structured data using SQL or a simple API that works like a DataFrame (a table-like structure).
- Spark Streaming helps you process data in real-time, such as data streams like video feeds or sensor data, in a way that’s similar to how batch data is processed. It ensures that data is handled correctly, even if something goes wrong.
- MLlib is a library for machine learning, offering tools for tasks like classification, regression and clustering.
- GraphX is a library for working with graph data (like networks or relationships) and performing operations like analysis and transformation.
In summary, Apache Spark is designed to make the most of distributed resources, allowing large volumes of data to be processed and analyzed quickly and at scale, like runners all crossing the finish line at the same time.
Apache Spark’s architecture – it’s a team sport
Back to the Olympics metaphor, you might want to think about Apache Spark as if it was a team made up of (usually but not exclusively) 200 players. Apache Spark’s architecture is made up of three main components: the driver, the executor, and the partitioner. The driver acts like a sports team coach, interpreting the code, creating a plan, and instructing the executors. Executors, in turn, execute these commands. Lastly, the partitioner is responsible for dividing data into smaller chunks (typically around 200 partitions), facilitating efficient and parallel computing.
From an architectural perspective, Spark utilises a manager/worker configuration. In this setup, a manager determines the number of worker nodes needed and how they should function. The worker nodes then execute tasks assigned by Spark, such as processing data streams, performing grouping, filtering, and applying business logic.
In more technical terms:
- The Spark driver is the process or program that oversees the execution of a Spark application. It runs the main function and initialises the SparkContext, which handles the connection to the cluster manager.
- The Spark executors are worker processes tasked with running operations in Spark applications. They are deployed on worker nodes and interact with both the driver program and the cluster manager. Executors handle tasks in parallel and manage data storage, either in memory or on disk, for caching and intermediate results.
- The cluster manager handles resource allocation and oversees the management of the cluster where the Spark application operates. Spark is compatible with multiple cluster managers, including Apache Mesos, Hadoop YARN, and its own standalone cluster manager.
- SparkContext serves as the gateway to all Spark functionalities. It represents the connection to a Spark cluster and is used to create RDDs (Resilient Distributed Datasets), accumulators, and broadcast variables. Additionally, SparkContext manages and coordinates the execution of tasks across the cluster.
- A task is the most basic unit of work in Spark, representing a computation that operates on a single data partition. The driver program breaks down the Spark job into individual tasks and distributes them to executor nodes for processing.
Apache Spark features a clear, structured layer architecture built around two primary abstractions:
- Resilient Distributed Dataset (RDD): RDDs are the foundation of Spark. An RDD (Resilient Distributed Dataset) is an immutable, core collection of elements that can be processed across multiple devices at the same time, using Spark’s parallel processing. Each dataset within an RDD can be split into logical partitions, which are then executed on different nodes within a cluster. Ultimately RDD contributes to the DAG.
- Directed Acyclic Graph (DAG): The DAG is an acyclic graph representing the workflow in Spark – an execution plan which organizes the operations that need to be executed. It is the scheduling layer in Apache Spark’s architecture that manages stage-based task scheduling. Unlike MapReduce, which organizes tasks into just two stages – Spark’s DAG can encompass multiple stages, allowing for more complex and flexible workflows.
Spark vs Hadoop: usecases
Generally speaking, Spark’s advantage over Hadoop is speed, as Spark leverages in memory distributed computing, a more efficient model than its contender. In fact Spark is able to perform tasks up to 100 times faster than Hadoop, making it a great solution for low-latency processing use cases, such as machine learning. In the 100 meter sprint, Spark has already finished when Hadoop has barely left the starting block.
Going deeper, there are few reasons which make Spark the de facto choice in the industry for managing big data pipelines:
- Fault tolerance: A key limitation of Hadoop is its dependence on disk I/O for fault tolerance. After the shuffle and sort phase, data is written to disk and later retrieved during the reduce phase, severely impacting performance. In contrast, Spark uses in-memory computation, utilising a DAG of RDDs so that if an RDD is lost or corrupted, it can be quickly recomputed by reapplying the necessary transformations to the previous RDDs.
- Programming language: Hadoop mainly supports Java, which can be difficult and developer-unfriendly for some. In contrast, Spark supports multiple languages, including Java, Scala, Python, and R. Scala, in particular, offers a more concise and streamlined syntax compared to Java, making it more convenient for developers (it’s also Spark’s default language).
- Storage: for a job, Hadoop requires all files to be stored in HDFS; moreover the job results will also be saved there, in contrast, Spark supports various read/write sources.
- Iterative processes: For algorithms such as PageRank that involve iterative processing, Spark has a distinct advantage due to its in-memory computation. In contrast, Hadoop faces performance challenges because it writes results to disk after each iteration and then retrieves them for the next iteration, resulting in significantly slower performance compared to Spark.
- Interactive Mode: Spark provides an interactive shell with low latency, enabling more dynamic and responsive data analysis, whereas Hadoop does not support interactive modes due to its higher latency. Additionally, in contrast to Hadoop, Spark can optimise its execution plan by evaluating multiple transformations together. Finally, Spark enables you to use OIDC instead of Kerberos, greatly simplifying authentication.
Why is Apache Spark on Kubernetes the winning solution?
Not even the best athletes can reach their full potential alone; Spark performs best when it’s supported by the right technology. Deploying Spark applications on Kubernetes offers numerous advantages over other cluster resource managers, such as Apache YARN. Here are some of the main benefits:
- Simplified deployment and management: Using tools such as MicroK8s or managed cloud services, cluster management becomes much easier and more cost effective.
- Simplified authentication: Kubernetes makes authentication easier than YARN and Kerberos, while maintaining high security standards.
- Unification of workloads: It allows Spark to be run together with other applications on a single platform, making it more efficient.
- Separation of data and computation: Kubernetes makes it possible to separate data from computation, following best practices. Thanks to modern networks, it is possible to maintain high performance even on a large scale.
However, using Apache’s native tools for deploying Spark on Kubernetes can be complicated. Various parameters need to be configured and various requirements need to be met in order for everything to work properly. This is where Charmed Apache Spark comes in, greatly simplifying the process by offering automation and a more intuitive management interface, effectively reducing the difficulties associated with configuring and deploying Spark on Kubernetes.
Unlike other Kubernetes operators for Apache Spark, which can complicate the user experience by requiring direct interactions with the cluster via Custom Resource Definitions (CRD) or knowledge of Helm, Charmed Spark stands out for its simplicity. Based on Juju, a unified management platform, Charmed Spark simplifies the use of advanced technologies on both cloud IaaS and Kubernetes-compatible clusters.
Charmed Spark is officially supported when deployed on MicroK8s, Canonical Charmed Kubernetes and AWS Elastic Kubernetes Service (EKS). In addition, Charmed Spark offers maintenance of the Canonical ROCKs OCI container image and the ability to subscribe to a 24/7 technical support service. Other solutions for deploying Spark on Kubernetes may not guarantee regular image maintenance or offer paid support options.
Some common use cases for big data champions
If your use case has any of the following requirements, Spark is likely to be best choice:
- Data size and hardware capacity: When the data has a large volume so that the processing capacity and memory of a single machine are not sufficient.
- Complexity of operations: When operations involve large joins, complex aggregations or large-scale calculations that require distributed processing.
- Scalability: When it is necessary to scale horizontally in order to process large volumes of data efficiently on a cluster of machines.
- Performance requirements: When it is necessary to improve performance and reduce processing time for large data sets.
- Machine learning: Data scientists can enrich their expertise in data analysis and machine learning by using Spark with GPUs. In fact the ability to process large volumes of data quickly, leveraging an already familiar language, can significantly accelerate the innovation process.
It is not always the case that Apache Spark and Hadoop are competing solutions, but they can also be used together depending on business needs. For example, Hadoop is ideal for batch processing, handling large volumes of data in parallel on multiple nodes, and is suitable for tasks that are not time sensitive.Many organizations decide to combine Hadoop and Spark to benefit from both the advantages of Hadoop and the in-memory processing speed of Spark.
Apache Spark takes gold
In conclusion, Apache Spark is a powerful and flexible tool designed for processing large volumes of data efficiently using distributed computing. Spark is widely used for real-time data streaming, machine learning, and graph processing, and it can be accessed through multiple programming languages such as Java, Scala, Python, and R. By leveraging in-memory computation and a fault-tolerant system, Spark outperforms older tools like Hadoop, making it the go-to choice for big data analysis. Canonical’s Charmed Apache Spark on Kubernetes simplifies the deployment and management process, offering greater flexibility, performance, and ease of use, ensuring quick, reliable, and scalable data processing. With Spark on Kubernetes, organizations can achieve their data goals faster, with greater precision and scalability, securing their place on the podium of big data analytics.
- Starting a new project? Contact us
- Build your online data hub with our whitepaper
Discover more from Ubuntu-Server.com
Subscribe to get the latest posts sent to your email.