Data is at the heart of all machine learning (ML) initiatives – and bad actors know it. As AI continues to occupy the limelight of modern tech discourse, ML systems are becoming increasingly attractive targets for attack. With the Identity Theft Resource Center reporting a 72% spike in data breaches in 2023, it’s critical to take the proper precautions to ensure your ML projects don’t provide a back door to your data.
This blog gives an overview of machine learning security risks, highlighting the key threats and challenges. But it isn’t all doom and gloom; we’ll also explain best practices and explore possible solutions, including the role of open source.
Every technology is subject to security concerns, but the challenge is even greater with machine learning because of the lack of talent and the innovative applications of AI. Some of the security factors include:
With so many moving parts and potential avenues of attack, machine learning projects are subject to a large number of security risks, and this number is only growing as more and more ML applications enter production. Here are the four most common threats you should be aware of.
Depending on the regulations of each industry and organisation, companies must ensure that the software they use does not contain any critical or high vulnerabilities. However, ML projects often depend on thousands of packages, so it’s easy for vulnerabilities to slip through the cracks. These vulnerabilities can appear at all layers of the stack, from the operating system to the applications, and they can be major security risks when exploited maliciously. An infamous example in the AI space is ShellTorch, which exposed all the code used during development and enabled people to access the models.
To mitigate this risk, you should have a clear understanding of the packages that your ML projects use, as well as their dependencies. You should implement regular scans that report vulnerabilities and have a strategy for fixing them. This includes having regular updates and upgrades of the tools used, following the latest news and security updates and having a trusted advisor.
Data poisoning can happen when data is used to train a model or product, and an outcome is altered to damage the system’s performance. New data is often introduced into the system, causing the model to learn something new that is both inaccurate and unintended. For example, if your training dataset includes surveillance camera footage, attackers might maliciously target it and use only red cars for a certain period of time to train the model maliciously. Even a small amount of incorrectly labelled data can affect the model’s performance, making it unreliable for production.
With a clear understanding of how data can be influenced by possible attackers, you can implement measures to mitigate these risks. A continuous re-training pipeline ensures models always stay up to date, while drift monitoring for both the model and the data can ensure that professionals are informed in a timely manner if a model’s accuracy or structure varies.
These are the most commonly used types of attacks in the machine learning space, which involve tricking the ML model to give the desired result. They usually include input provided by the attacker, giving a specific expected output. They basically trick the system due to the low number of boundaries that ML systems often have. Adversarial attacks are hard to detect by the human eye and even monitoring systems, mostly because models do not also learn the decision boundary, which is used to separate different classes based on the features of the input.
Adversarial attacks reduce the model accuracy and can cause professionals to avoid running any more certain projects in production. Organisations should consider adversarial training and have a clear strategy when building ML projects and cleaning data. Not all data that is produced should land directly in a training set. In addition, not everyone should have access to all models created within an organisation and capabilities such as experiment tracking, model stores and model performance trackers.
ML algorithms are built to predict or generate new data by only looking at the existing information. Companies, compared to individuals, have access to the personal data of millions of people. Whenever data is given access to an ML system, there is a risk associated with its confidentiality because of the new workflows it involves.
Organisations should build a clear privacy policy that all users should read and agree on and create ML systems that protect everyone. They should also be mindful of the data they gather and the processes involved in it. Best practices, such as removing identifiers and clear visibility of the data flows, will protect the privacy of both organisations and people who interact with it.
The four threats outlined above are just some of the machine learning security risks that projects face. They stand alongside the traditional software threats that are always present in any technology. As such, protecting your ML systems requires a specialised approach that considers the unique risks present in the AI space while also drawing on broader security best practices:
Open source is at the heart of development for machine learning. The Linux Foundation Data & AI shows the abundance of tools that data scientists, ML engineers, and architects have nowadays that are available to experiment and run their ML projects in production. It includes open source tools that are used to develop, deploy, monitor, or store models. Many of them focus on security, which we will further explore.
Canonical has open source in its DNA. The company’s promise is to provide secure open source software across all layers of the stack, from the operating system to the cloud-native applications. We will further explore how our security tools and capabilities enhance ML systems.
Livepatch is a solution that periodically checks for kernel patches and applies them to hardware without rebooting it. It enables organisations to update their hardware with the latest kernel patches, reducing downtime and unplanned work. This is also extremely useful when performing training for a longer period of time because of its ability to continue it without putting at risk the outcome or causing project delays. Additionally, it enables organisations to build and follow their own update strategy by planning the patching time and rollout policy. Ubuntu comes with Livepatch as part of Ubuntu Pro.
Confidential computing originated in the late 1970s, but the rise of AI also accelerated its adoption. Using innovative technology at the silicon level, it protects the confidentiality and integrity of the sensitive data hosted on-prem or on a public cloud. Highly regulated industries such as healthcare or financial services often adopt it. Ubuntu is at the heart of confidential computing, being already available on Microsoft VMs or Intel TDX. Learn more about confidential computing.
New common vulnerabilities and exposures (CVEs) are coming up daily and need timely patching. Ubuntu Pro helps teams address this need in a timely manner by fixing over 30,000 packages as part of the subscription. This includes machine learning tools such as Pandas, Python, Numpy, Tensorflow or PyTorch, enabling professionals to develop models securely. Read more about how to secure your MLOps tooling.
Machine learning Operations (MLOps) platforms such as Charmed Kubeflow enable organisations to run AI at scale. They ensure that the ML systems have features such as authentication capabilities or network isolation to better control and protect data and ML models. They are a foundational piece to run the entire ML lifecycle within one tool, reducing the number of security holes that could appear throughout the ML pipeline.
Snaps are a secure and scalable way to embed applications on Linux devices. They can also be used for ML models that are packed and deployed to edge devices. It simplifies their maintenance and enables them to benefit from OTA updates and auto rollback in case of failure. Brand stores can also help you manage multiple models. Learn more about AI at the edge with open source.
ML systems are compelling targets for malicious actors, but that fact shouldn’t hold you back from innovating with AI. By developing a strong understanding of the threat landscape, implementing best practices, and taking advantage of open source solutions, you can protect your models and data and enjoy the benefits of AI/ML without putting your organisation at risk.
Industrial cybersecurity is on every CISO’s mind as manufacturers strive to integrate their IT and…
Dec 01,2024 Xfce 4.20 Pre2 Released Dear Xfce community, I am happy to announce the…
When you buy a Linux VPS with Bitcoin, you are getting a private virtual server…
Anaconda is a package, dependency function, and environment management. As environment management for programming languages,…
In September we introduced Authd, a new authentication daemon for Ubuntu that allows direct integration…
We’re looking for motivated people that want to join the Ubuntu Technical Board! The Ubuntu…