A study from ManageForce estimated the cost of a database outage to be an average of $474,000 per hour. Long database outages are the result of poor design concerning high availability.
With the exponential growth of data that is generated over the internet (which is expected to reach 180 zeta-bytes by the end of 2025) and the increasing reliance on different database technologies to serve those data to their intended users, the cost of database downtime will continue to increase in the upcoming years.
This blog will first expose the main concepts around availability. Then, we will list some of the patterns to provide highly available database deployments and finish by explaining how Canonical solutions help you in deploying highly available applications.
Before going into the details of how to achieve high availability for databases, let’s ensure that we have a common understanding of a few concepts.
Availability is a measure of the uptime of a given service over a period of time. It can be understood as the opposite of downtime. For example, a monthly availability of 99,95% implies a maximum downtime of about 22 minutes per month.
Durability is a measure of the ability of a given system to preserve data against certain failures (e.g. hardware failure). For example, a yearly durability of 99.999999999% implies that you might lose one object per year for every 100 billion objects you store !
Note that data availability and data durability are quite different. You might be unable to access the data during a database outage but you still expect the persisted data to be reachable when the database is up again.
In the rest of this blog, I will use instance to refer to the compute part of a database deployment in a given host. The instance is the interface to which the database client connects.
Besides, I will use the term database to refer to the storage part, essentially the data “files”, managed by the associated instance(s).
Note that the instance and the database can reside in different hosts.
High availability is typically achieved using redundancy and isolation constructs. Redundancy is implemented by duplicating some of the database components.Isolation is achieved by placing the redundant components in independent hosts.
The term cluster refers to the entirety of the components of a database deployment, including its redundant ones. Together, these components ensures the availability of the solution.
Let’s explore in the next section some of the clustering patterns.
As we saw earlier, there are 2 main parts of a database deployment that we can make redundant to achieve high availability:
In this type of clustering, we protect the database instance part by deploying several instances in different hosts. The database resides, typically, on a remote storage visible to all the concerned hosts.
We can have 2 types of instance-level clustering:
In this type of clustering, we protect both the database instance and the database. Protecting only the database might result in a situation where your data is protected but there is no way to access it quickly.
Note that we can create offline copies of the database for backup purposes. Yet, the main purpose, in such a case, is data protection.
The high availability of the database is achieved through replication. We can distinguish 2 types of replication depending on the layer performing it:
We use the term replicas to denote the additional database(s) resulting from the replication mechanism.
The primary database/instance is the one receiving the client’s write traffic.
Shared-nothing database deployments are composed of independent servers, each having its own dedicated memory, compute and storage. They tend to provide higher availability as they allow us to leverage all the isolation constructs we will see in the next section.
Isolation is about reducing the impact radius of a given failure/disaster event on your cluster components . The more distant your redundant components are, the less likely that all of them will fail simultaneously.
This is the most elementary form of isolation. Placing redundant components in different servers prevents a failure in a network card, an attached storage device or a CPU from impacting all of your redundant components.
A rack is a standardised enclosure to mount servers and various other electronic equipment. The servers hosted in the same rack might share a number of elements like network switches and power cables. Placing your redundant components into servers hosted in different racks will prevent a failure on one of the rack-shared components from impacting all of your deployment.
Typically, all the servers hosted in a given data centre share power and cooling infrastructures. Using several data centres to host your database deployment will make it resilient towards a broader range of events, like power failures and data centre-wide maintenance operations.
Public cloud providers popularised the concept of “availability zone”. It consists of one or more data centres that are, geographically speaking, close to each other.
Using several availability zones to host your database deployments might protect your services from some “natural” disasters like a fire and floods.
We can go one step further in terms of isolation and use several regions for our database deployments. This kind of set-up can protect your database from major disasters like storms, volcano eruptions and even political instability (think about transferring your workload from a war zone to another region).
Now that we have a good overview on how to make a given database deployment resilient to certain failure events, we need to make sure that we can automatically leverage its resiliency.
Fail-over is the process by which we transfer ownership of a database service from a faulty server to a healthy one.
A fail-over can be initiated manually by a human or automatically by a component of the database deployment. Relying on human intervention might result in higher downtime compared to automated fail-overs, therefore it should be avoided.
In order to automatically initiate a fail-over from a primary instance/database to a replica, a component should monitor the state of each of the instances and decide to which healthy one the service should be transferred.
Here is a list of fail-over tools for some of the popular databases:
If you are planning to use such tools then you need to perform a series of tests (both on primary and replicas) to ensure that the cluster behaviour matches your expectations:
Depending on the behaviour of the chosen solution, you might need to implement your own customisation to meet your requirements.
An important aspect, often overlooked, in database high availability is the capacity of the application to promptly (re)connect to the healthy database instances following a server fail-over.
In order for your applications to fail-over quickly to the healthy instances, you might need to:
As we saw during the previous section, ensuring database high availability involves a considerable amount of work to design, test and maintain several components and configurations.
In the next section, we will talk about the Canonical offerings that can help you reach your high availability goals.
Juju is the Canonical’s answer to automatically manage complex applications involving any number of technologies, including databases.
We provide a curated list of charmed Juju operators for a wide range of databases with built-in high availability and fail-over automation.
Moreover, Juju’s unique ability to express relations between various workloads helps you in ensuring, for example, that your application will always target a healthy database instance.
Juju helps DevOps, DBAs and SREs in quickly deploying, maintaining and upgrading applications in a holistic fashion.
The Juju ecosystem allow its users to retain a high degree of customization and freedom:
Please contact us to learn more about Juju and our solutions to achieve high availability for your workloads.
Welcome to the Ubuntu Weekly Newsletter, Issue 866 for the week of November 10 –…
Debian and Ubuntu are two popular Linux distributions. In this deep dive we will guide…
In this article, we will see how to Install Google Cloud BigQuery Python client library…
Nov 15,2024 Wallpaper Contest for Xfce 4.20 open for voting The submission phase for the…
MicroCloud 2.1.0 LTS is now available, expanding the number of Canonical infrastructure solutions with a…
Canonical is thrilled to be joining forces with Dell Technologies at the upcoming Dell Technologies…