Categories: BlogCanonicalUbuntu

Optimising Ubuntu performance on amd64 architecture

Everyone wants the Linux distribution they are using to be fast. This is practically a content-free statement, of course: who would want their distro to be slow?

Sponsored

But at the same time, what does it mean for your distribution to be fast? For example, Ubuntu 21.10 switched the default compression for packages to zstd, which made them faster to both download and decompress, improving the performance of one important operation on Ubuntu. But, of course there are many, many other aspects of performance and this article is about something very different: the processor features Ubuntu assumes are available.

In this post, I will talk a little about the history of the amd64 architecture and some investigations we are doing in collaboration with Intel to make better use of newer processors.

Background

By far and away the most used architecture for Ubuntu is amd64, also known as x86-64 in some contexts. Ubuntu is still built for the very first amd64 CPUs, the AMD K8 from 2003 and Intel’s 64-bit Prescott from 2004, using the original instruction set architecture (ISA).

Over the years, Intel and AMD have added a number of extensions to the ISA, for example:

  • SIMD: SSE3, SSE4, AVX, AVX512, etc
  • Special purpose: RDRAND, AES-NI, VNNI
  • Slightly more general: cmpxchg16b (atomic compare and exchange), vfmadd* (fused multiply-add for floating point), movbe (byte order conversion)

Not using these new instructions to improve performance throughout the distribution seems like a missed opportunity. A few core packages like glibc and openssl do runtime detection to use newer instructions when they are available but vastly more packages do no such thing.

A significant difference between an architecture like amd64 and, say, POWER is the diversity of implementations. The POWER architecture has been extended over the years in several ways, but a processor from 2018 can reasonably be assumed to support every instruction supported by a processor from 2013. This is not at all true for the amd64 world. For example, SSE4.1 was introduced in the Penryn microarchitecture in 2007 but as late as 2012 designs that did not support it (e.g. the Centerton range of Atoms) were being released. In addition, both AMD and Intel have introduced extensions that the other has eventually implemented (as well as extensions that never really became widely used and eventually disappeared such as 3DNow! and TSX).

For a long time, the dynamic loader (part of glibc) has allowed distributions to take some advantage of newer CPU features by searching extra directories when support for these features is detected, but on amd64 versions of glibc prior to 2.33 these additional directories were based on ad-hoc, poorly defined selections of capabilities. For example, to my knowledge /lib/x86_64-linux-gnu/haswell was searched on most Intel processors since 2014, but no AMD ones at all.

In 2020, the glibc developers, particularly Florian Weimer of Red Hat, got sufficiently fed up with this mess to propose a solution on the libc-alpha mailing list: assemble reasonable sets of CPU features into “levels” that are mostly supported together, and have the dynamic loader search directories based on these names.

Some bikeshedding later, four levels were defined, each including the previous: “v1” or baseline, “v2”, “v3”, “v4” and these definitions were added to the “psABI” specification (roughly speaking the document that defines what binary code for an amd64 Linux system looks like):

Level Name CPU Feature Example instruction
(baseline) CMOV cmov
CX8 cmpxchg8b
FPU fld
FXSR fxsave
MMX emms
OSFXSR fxsave
SCE syscall
SSE cvtss2si
SSE2 cvtpi2pd
x86-64-v2 CMPXCHG16B cmpxchg16b
LAHF-SAHF lahf
POPCNT popcnt
SSE3 addsubpd
SSE4_1 blendpd
SSE4_2 pcmpestri
SSSE3 phadd
x86-64-v3 AVX vzeroall
AVX2 vpermd
BMI1 andn
BMI2 bzhi
F16C vcvtph2ps
FMA vfmadd132pd
LZCNT lzcnt
MOVBE movbe
OSXSAVE xgetbv
x86-64-v4 AVX512F kmovw
AVX512BW vdbpsadbw
AVX512CD vplzcntd
AVX512DQ vpmullq
AVX512VL n/a

Reference: page 14 of the psABI

As alluded to above, it’s not really possible to say that a processor from a given era supports a given level, but as a rough guide most processors from 2009 onward support v2 and most processors from 2015 on support v3.

Sponsored

v4 is complicated: Intel 11th Gen has support but 12th Gen and 13th Gen processors do not and AMD’s new Zen 4 microarchitecture adds support. It’s hard to know what the future holds for AVX512 and I’m not going to consider it for the rest of this article.

From glibc to the toolchains

Although the original idea of these levels was to rationalise the process by which the dynamic loader looks for shared libraries, they also provide a sensible label for a set of instructions assumed to be available by all parts of the distribution. Support for using “x86-64-v$N” as values for the -march flag was added to GCC in version 11 and LLVM in version 12.

It is worth noting here that we are only really talking about the C and C++ toolchains in this document. While the distribution clearly contains a great and increasing amount of code in other languages (Python, Go, Rust, Java, Ruby, …), a large majority of the code is in C/C++. For some language ecosystems, in particular Python, a lot of the performance sensitive code is in C/C++ anyway (e.g. numpy). The other statically compiled toolchains (like Rust and Go) do have support for selecting the precise ISA they target but for the rest of this document we will only think about C and C++.  

Bumping the baseline?

It is a trivial change to the packaging of GCC to change the default value for -march, and some distributions have already made this change – both RHEL9 and SUSE Tumbleweed (as of Nov 2022) target x86-64-v2.

These changes have both a cost and a benefit:

  • For users that have hardware that is too old to support v2 instructions, these operating systems will not work at all.
  • For users that have paid for better hardware, these operating systems take better advantage of that hardware.

For a commercial distribution like RHEL, this probably still makes sense: if you are spending the money to get a RHEL (or SLES or …) license you are probably already running reasonably up-to-date hardware, or at least the additional cost of updating to hardware that is less than 10 years old is fairly insignificant. It is interesting to note that SUSE’s new “adaptive linux platform” product originally proposed targeting v3 and later scaled this back to v2.

For a free distribution like Ubuntu (or Fedora), the calculation is different: allowing users to extend the life of hardware by installing a free linux distribution is a significant, positive aspect of the open source world, and it is very likely that the users who are still using 2008-era hardware with Ubuntu are the users who are least able to upgrade.

That said, hardware doesn’t last forever. A few years ago, the cost of maintaining full support for 32-bit x86 machines started to outweigh the benefits and we stopped building most packages. Making a considered decision here requires data. Specifically:

  • Usage – How many Ubuntu users are using hardware that supports only v1 or v2?
  • Performance – How much performance improvement does changing the default to x86-64-v2 or x86-64-v3 bring anyway?

Neither of these questions is easy to answer.

Trying it for yourself

While we continue our own performance analysis and further assess the needs of our users, we have released an experimental Ubuntu 23.04 Server build – using -march=x86-64-v3 and -mtune=icelake-server – for the community to try out. As we consider the potential perks and drawbacks of using v3 system-wide, your feedback and observations will be an invaluable part of the process. Here are some of the questions we have for our own efforts:

  • On aggregate, is the v3 version faster than the baseline v1 version of Ubuntu? As we mentioned above, this can mean a lot of different things, from quantitative benchmarking to a looser, qualitative feeling about speed from the user perspective.
  • Are there certain domains where performance overwhelmingly benefits from or regresses because of v3?
  • Do these changes break anything?

This discourse post explains where to find an installer for this build, which is not only built out of the rebuilt packages, but will install packages from the rebuild archive by default. Please note that this is for testing only. Systems installed using this installer will receive no security (or any other) updates and will be in no way suitable for use in production.

We will be making another post when our own benchmarking is complete to explain what we tested and the results we found. Stay tuned!

Ubuntu Server Admin

Recent Posts

Building RAG with enterprise open source AI infrastructure

One of the most critical gaps in traditional Large Language Models (LLMs) is that they…

14 hours ago

Life at Canonical: Victoria Antipova’s perspective as a new joiner in Product Marketing

Canonical is continuously hiring new talent. Being a remote- first company, Canonical’s new joiners receive…

2 days ago

What is patching automation?

What is patching automation? With increasing numbers of vulnerabilities, there is a growing risk of…

3 days ago

A beginner’s tutorial for your first Machine Learning project using Charmed Kubeflow

Wouldn’t it be wonderful to wake up one day with a desire to explore AI…

4 days ago

Ubuntu brings comprehensive support to Azure Cobalt 100 VMs

Ubuntu and Ubuntu Pro supports Microsoft’s Azure Cobalt 100 Virtual Machines (VMs), powered by their…

4 days ago

Ubuntu Weekly Newsletter Issue 870

Welcome to the Ubuntu Weekly Newsletter, Issue 870 for the week of December 8 –…

4 days ago