Data science is one of the most exciting topics of the last century. With its utility in industries of all kinds, it’s easy to see why it has been rated as one of the top 20 fast-growing occupations in the US, according to the Bureau of Labour Statistics. However, entering this fast-growing space isn’t’ easy: newcomers face significant challenges in setting up their environments, dealing with package dependencies, or accessing compute resources. Given these obstacles, it’s easy to see why a talent shortage persists in the data science field, and why overcoming these challenges is vital for teams and companies.
This blog will walk you through the most common challenges that data science newcomers face, review popular data science platforms, and take a look at the bigger picture of how open source is used in data science. With these insights, you will be able to more easily choose the right tools and options to simplify your work and focus on upskilling in the data science field.
Data science is a rewarding career, but starting out as a newcomer can be challenging. Here are the most common obstacles that new data scientists face when starting out their careers:
As you can see, new data scientists typically face a rough start. However, the good news is that once they are on track, it gets easier and easier every day.
As I mentioned before, it seems that a new tool, framework or library for data science or machine learning is launched every other day. This can be overwhelming. How do you actually choose from this wide variety of options?
Before we get into the weeds of tools, let’s take a moment to look at the main capabilities and key considerations that a data science platform should have:
Scalability: While many AI projects start small, every data scientist should also have a long-term vision and consider scalability capabilities. This helps data scientists to grow as the project matures without a need to upskill in other tools.
Now that we have an idea of what to look for in a data science tool or platform, let’s take a closer look at the popular options that data scientists use.
Returning to the preference for open source, we should look at the entire stack and how open-source tooling can accelerate the entire process. Linux has pioneered the open source space, with Ubuntu being the most adopted distribution. It has a powerful command line that data scientists and machine learning engineers enjoy using, and it simplifies their operational tasks. Furthermore, there is a lot more from open source that could enhance someone’s journey in the data science space. Python is a great example: it’s the preferred programming language in data science, and many of its libraries, such as Pandas, Numpy, PyTorch, and TensorFlow, have been widely adopted in countless data science projects.
But how do you actually build the models? In the Stack Overflow report we mentioned above, Jupyter Notebook is listed as one of the top technologies used in data science. It is a powerful tool for performing many data science or machine learning tasks, including cleaning data and building ML pipelines or training models. In the same area, MLflow, which is used for experiment tracking and model registry, hit 10 million users over a year ago, leading to open source adoption. Such a platform is often deployed on a workstation with a GPU, which must also be configured. NVIDIA, for example, has a GPU operator that streamlines the experience for cloud-native applications.
These are just some of the examples of tools that one can use. Once they are selected, data scientists need to integrate them into a cohesive solution. Whenever they are deployed, they use a series of different packages that have dependencies and versioning constraints. Users need to coordinate this effort to ensure the good functionality of the platform, including upgrades and updates which might challenge it.
Looking at data scientists’ initial challenges, they should look for tools that cover most of them at the lowest possible cost. The Data Science Stack (DSS) is a solution provided by Canonical that puts together leading open source tools that cover part of the machine learning lifecycle, enabling users to develop, optimise and store models without hefty start-up costs, time-consuming setup or difficult configurations.
Data Science Stack (DSS) is an out-of-the-box solution for data scientists and machine learning engineers, published by Canonical. It is a ready-made environment for ML enthusiasts that enables them to develop and optimise models without spending time on the necessary tooling. It is designed to run on any Ubuntu AI workstation, maximising the GPU’s capability and simplifying its usage. Are you curious?
DSS includes leading open source tools, such as Jupyter Notebook and MLflow, with full integration. It has, by default, two of the most adopted ML images, Pytorch and TensorFlow. They can be deployed using an intuitive command line interface (CLI), and then, the UIs of the tools can be accessed to dive into data science.
Beyond giving access to an ML solution, DSS also takes care of the packaging dependencies, ensuring that all the tools, libraries and frameworks work seamlessly together and are compatible with the machine’s hardware. In addition, DSS simplifies the GPU configuration by including the GPU operator and all the benefits that it comes with.
It is available in beta, inviting data scientists, machine learning engineers, and AI enthusiasts to share their feedback with us. You can easily deploy it on your Ubuntu machine, tell us about your experience, and benefit from this ongoing community feedback.
If you would like to learn more about data science tools join our webinar on [date]. Together with Michal Hucko, we will talk about:
Our latest Canonical website rebrand did not just bring the new Vanilla-based frontend, it also…
At Canonical, the work of our teams is strongly embedded in the open source principles…
Welcome to the Ubuntu Weekly Newsletter, Issue 873 for the week of December 29, 2024…
Have WiFi troubles on your Ubuntu 24.04 system? Don’t worry, you’re not alone. WiFi problems…
The following is a post from Mark Shuttleworth on the Ubuntu Discourse instance. For more…
I don’t like my prompt, i want to change it. it has my username and…