One thing we want to prevent in your private cloud at all times, is downtime of your cloud workloads. That is why we have architectural safeguards built in, such as redundancy and cloud aware or cloud native application development; this way if one component in the infrastructure fails, others part take over and your users will never know the difference. But what if your application is not fully cloud aware yet? Or new instances of it cannot be orchestrated? For those cases you want to use a high availability solution.
High availability is the continuous availability of systems in the wake of components failures. The goal is to minimize the workload downtime. It’s important to realize that redundancy is not enough to realize high availability: it also requires fault tolerance which is the ability of your workload to continue its service without interruption on the redundant component.
Compare it to multi-tyre axes on a truck versus a spare tyre on a 4×4. If one of the tyres on the truck blows, the others can still help you get home safely, where you can fix it. They take over the load for a while, without even having to stop and replace it. Starting with more tyres than minimum needed, is called redundant. During normal operations the extra tyres even can help divide the load. In case of a tyre failure though, the remaining tyres automatically take over all the load, without the truck having to stop and get emergency maintenance. On the other hand, the 4×4 is provided with a redundant tyre, but the car will need to stop immediately when 1 tyre fails: unless you are very skilled in changing cars and willing to take enormous risks, you will need to add the redundant tire by emergency maintenance alongside the road to be able to proceed running your workload. Switching to a redundant component will cause downtime or at least an interruption of services.
A cloud native application architecture provides fault tolerance on the application level to continue processes running while redundant components are switched off and on.
In a different article I discussed ‘Ceph Storage which provides highly available cloud storage using fail tolerance software mechanisms and multiple redundant disks, over multiple nodes, over multiple clusters. Later this year I will discuss how to create high availability for ‘Open Networking’ with Cumulus.
As OpenStack is especially designed for cloud native or cloud aware workloads, it does not require a solution for a high availability solution for compute processes. But as OpenStack has become widely used, also for not fully cloud native workloads, a few very interesting projects can help you with that. In an earlier article I discussed ‘Disruptive Hybrid Cloud Management and Cloud Migration’ which describes how we move workloads between different clouds with only a minimum interruption; this mechanism can also be used for providing high availability for non-native cloud workloads with minimum interruptions. In this article I will share some information about another solution to automatically recover failed instances.
Workload high availability with Masakari
One of the various options to provide HA-like functionality in OpenStack, we extensively use, is Masakari.
Masakari is an OpenStack add-on which provides high availability services for OpenStack clouds, by automatically recovering so called failed Instances. It recovers virtual machines from failure events such as a VM process that is down, the provisioning process which is down, but also nova-compute host failures. Masakari also provides an API service to manage and control the automated rescue mechanism.
OpenStack High Availability goals
Before getting into how the architecture is setup; let’s have a look at some high availability goals of the cloud infrastructure.
1. Control Plane availability; you want to make sure that the existing cloud recourses are not affected by failures of infrastructure components
2. 100% data plane availability; you want to ensure no disruption to existing applications and/or VMs take place.
3. Prompt recovery of services in the event of disruption; you want to make sure that you are monitoring the cloud infrastructure.
4. Monitoring and alerting; make sure you are alerted early enough so you have time to start recover rapidly.
To reach these goals with Masakari, a combination of Corosync and Pacemaker can be used. Pacemaker resource agents control the DRBD devices (block device layer) via Corosync and are therefore able to organize high availability of machine nodes in an OpenStack environment. Corosync makes sure that the virtual IP is always available on one of the nodes.
OpenStack Masakari creates a cluster of servers, detecting and reporting failure of hosts in the cluster. This is where Corosync comes in. The Virtual IP will live on one of the nodes. When this node fails, Corosync makes sure one of the other nodes takes over the Virtual IP.
The Masakari service consists of the following components:
- masakari-api: An OpenStack-native REST API that processes API requests by sending them to the masakari-engine over Remote Procedure Call (RPC).
- masakari-engine: Processes the notifications received from masakari-api by executing the recovery workflow in asynchronous way.
- masakari-monitors: Monitors for Masakari provides Virtual Machine High Availability (VMHA) service for OpenStack clouds by automatically detecting the failure events such as VM process down, provisioning process down, and nova-compute host failure. If it detects the events, it sends notifications to the masakari-api.
Redundancy is one of the solutions to keep your platform and workloads available when one part is out. Although it does require recovery of the failed instance, downtime is minimized. With the integration of Masakari, Pacemaker and Corosync within OpenStack, you take major steps to further enhance the high availability of the cloud infrastructure and your workloads on it. If you have any questions about Masakari or the other described components, I am happy to hear from you.