The use of AI is a hot topic for any organisation right now. The allure of operational insights, profit, and cost reduction that could be derived from existing data makes it a technology that’s being rolled out at an incredible pace in even change-resistant organisations.
However, AI systems that deliver these insights, savings, and profits, rely heavily on access to large amounts of data. Without performant, and reliable storage systems, even the most cutting-edge AI solution will not be able to provide timely results. Additionally, these new AI related workloads cannot impact existing business applications, both need to operate together harmoniously.
In this blog, we will explore some of the requirements placed on a storage system by an AI solution, as well as the types of data used. We will introduce Ceph as one of the options available to store both AI-related data and typical business data.
New AI applications mean additional pressures and requirements for your storage systems. Here’s what your storage system needs in order to support new AI workloads:
AI workloads need access to lots of data, and quickly: first, when reading raw data, and second, for writing the output following processing. It is not uncommon to see requirements in the region of 100’s GBps and over 1TBps!
Storage solutions such as Ceph allow for caching elements to be added to assist with bursty write workloads, and the ability to scale-out to increase overall system throughput.
The AI infrastructure of today is not what the AI infrastructure of tomorrow will look like. A storage system needs to be able to adapt to the needs of AI workloads, both in its ability to scale up for capacity and throughout reasons, but also to scale down should the hardware need to be reused elsewhere in an organisation’s infrastructure.
Following on from scalability, a storage system needs to be flexible enough to accommodate different types of AI workloads. Not all data is created equal; some can be more important than others, and over time it’s value can change as well. For example, bank transaction data is more likely to be accessed during the first thirty to sixty days when people are checking their balances and viewing end of month statements, than say in 3 years time. However, it is still important that the data is preserved and available should it need to be accessed at that time.
Therefore your storage system needs to be capable of offering different tiers of storage to meet this requirement. A storage system such as Ceph allows a user to combine heterogeneous hardware, allowing them to mix and match over time as system needs dictate.
The most important role that a storage system plays is storing data. There is little use of a storage system that is highly performant, but which cannot store data reliably; what good is generating and processing data if it can’t be retrieved?A solution like Ceph allows a user to choose from replication and erasure coding based protection strategies – again, to allow for a system configuration that can match business value with cost to store.
Now that we understand the characteristics that a high-quality storage system needs to provide, let’s take a look at what kinds of data are most typical in AI applications. There isn’t just “one” type of AI data. There are multiple different types, all used at various stages of developing, training and deploying AI models.
This is the source data extracted and retrieved from all manner of applications and systems: chat tools, email archives, CCTV recordings, support call recordings or autonomous vehicle telemetry – just to name a few examples. This data can be in all forms: database tables, text, images, audio, or video.
Once extracted from these systems this data is typically pre-processed to ensure that it is in a useful format for training. Pre-processing can also remove otherwise redundant steps later in the pipeline, saving time and computation resources. With certain data sets pre-processing is used to anonymise the data to ensure regulatory compliance is met.
Training datasets are typically a subset of pre-processed data that is used to train an AI model. What makes this dataset special is that the expected model output has already been defined. It is important that these datasets are preserved so that they can be used to refine a model or evaluate its performance.
The structure of an AI model – the layers and nodes – needs to be reliably stored so that the model can be redeployed in the future. Additionally an AI model contains parameters and weights that are tweaked during model training. Future adjustments can be made to these variables to fine tune the model or to deploy it in an inference role.
This is the most important of all the steps of importing, pre-processing, training, and deployment. The output, or inference data, is typically the most useful and valuable data for business uses, and it must be stored so that it is available for use. In some cases, this data needs to be retained for auditing and future refinement.
Finding a storage solution that delivers everything you’re looking for – cost, speed, flexibility, scalability, and support for a multitude of data sets and types – is difficult. Proprietary storage solutions can be inflexible, and public cloud services soon become costly as you grow; two areas where in-house open source solutions would be an ideal answer.
Canonical Ceph is a storage solution for all scales and all workloads, from the edge to large scale AI modelling, and for all storage protocols. Mixed workloads, with different performance, capacity, and access needs can all be accommodated by a single cluster. The scale out nature of Ceph means that hardware can be added incrementally to meet either performance or capacity needs.
Block storage needs are provided via the RADOS Block Device (RBD) protocol – a highly scalable multipath-native block transport. To support legacy environments iSCSI can also be accommodated via gateway, and in a future release highly available NVMeoF will be supported as well.
Shared File storage is presented either via Ceph’s native POSIX compatible protocol CephFS, or via the NFS protocol again via a gateway.
An Object storage API with compatibility for both the S3 and Swift APIs is fully supported in a Ceph cluster.
Join our webinar on using Ceph for AI workloads here and learn about:
One of the most critical gaps in traditional Large Language Models (LLMs) is that they…
Canonical is continuously hiring new talent. Being a remote- first company, Canonical’s new joiners receive…
What is patching automation? With increasing numbers of vulnerabilities, there is a growing risk of…
Wouldn’t it be wonderful to wake up one day with a desire to explore AI…
Ubuntu and Ubuntu Pro supports Microsoft’s Azure Cobalt 100 Virtual Machines (VMs), powered by their…
Welcome to the Ubuntu Weekly Newsletter, Issue 870 for the week of December 8 –…