What Is High Availability?
High availability is a term often used in computing, about system architecture or service to designate the fact that this architecture or service has an adequate level of availability.
Availability is now a key part of infrastructure. Today it is estimated that non-availability of an IT department may have cost millions in [ref. desired], particularly in the field of industry where the shutdown of a production line can be devastating.
Two complementary methods are used to improve high availability:
- The establishment of a dedicated physical infrastructure, generally based on hardware redundancy. Then created a cluster of high-availability (as opposed to a computing cluster): a cluster of computers whose goal is to provide a service while avoiding downtime.
- The establishment of appropriate processes to reduce errors, and accelerate recovery in case of error. ITIL contains many such processes.
To measure the availability, we often use the percentage mainly composed of ‘9 ‘:
- 99% means that the service is available less than 3.65 days per year
- 99.9%, less than 8.75 hours per year
- 99.99%, less than 52 minutes per year
- 99.999%, less than 5.2 minutes per year
- 99.9999%, less than 54.8 seconds per year
- 99.99999%, less than 3.1 seconds per year, Etc.
The amalgam is often wrongly between high availability and disaster recovery activities. These are two different tasks, complementary to achieve continuous availability.
Techniques for improving the availability
Many techniques are used to improve the availability:
- Redundant hardware and clustering;
- Data security: RAID, snapshots, Oracle Data Guard, BCV (Business Copy Volume), SRDF (Symmetrix Remote Data Facility), DRBD;
- The ability to reconfigure the server “hot” (that is to say when it works);
- Limp or a panic mode;
- Rescue plan;
- And secure backups: outsourcing, centralization third party site.
High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against theft and malicious. Note also the risk of fire and water damage. The power cables and communication must be multiple and buried. They should not be prominent in the underground garage of the building, which is too often seen in buildings in Paris. These criteria are the first to come into account when choosing a hosting provider (if the rental of a local high availability).
For each level of the architecture for each component, each connection between components, it must establish:
- How to detect a failure? Examples: Testing life TCP Health Check implemented by a housing Alteon, test program invoked periodically (heartbeat), interface type “diagnosis” on the components, etc.
- How secure is calling, redundant, rescued, etc. Examples: backup server, cluster system, WebSphere Clustering, RAID storage, backup, SAN double attachment, limp, unused material free (spare) ready to be reinstalled.
- How do we want the trigger switches to emergency / degraded? — Manually after analysis? Automatically?
- How to ensure that the backup system leave again on a stable and known. Examples: one starts with a copy of the database and reapply the archive logs, restart the batch from a known state, 2-phase commit for transactions updating multiple data repositories, etc.
- How the application restarts on the backup mechanism. Examples: restart application, restart of interrupted batches, activation of a degraded mode, the resumption of the failed server’s IP address by the backup server, etc.
- How to take any transactions or sessions. Examples: session persistence on the application server, a mechanism for response to a client for a transaction that has performed well before failure but for which the client does not have an answer, etc.
- How to return to the nominal situation.
- If a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again.
- If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)
Dependency vis-à-vis other applications
For an application seeking other applications with middleware synchronously (http web service, Tuxedo, CORBA, EJB) the rate of application availability will be strongly linked to the availability of applications on which it depends. The sensitivity of applications which it depends must be equal to or greater than the sensitivity of the application itself.
- The use of asynchronous middleware: MQ Series, JMS, SonicMQ, CFT
- Implementation of a limp when an application which we depend is failing.
For this reason we will emphasize the use of asynchronous middleware good availability preferred whenever possible.
Load distribution and sensitivity
The sensitivity is often managed by redundant elements with a load balancing mechanism. (A websphere cluster with an Alteon load-balancing for example). For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.
In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of one server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). For good reliability, it is useless to many servers are rescuing each other. For example, a reliable 99% redundant once gives a reliability of 99.99% (probability that the two elements is failing at the same time 1/100×1/100 = = 1:10,000)
The redundancy of an element is usually carried out using redundancy with multiple identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. It is for example the case of hardware failures.
This is not true of all failures, for example, an operating system failure or malfunction of a software component that can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we will consider the redundant elements with components of different natures but the same functions. This can lead to:
- Choose different kind of servers, with different OSes, software products of different infrastructure,
- Develop the same component twice respecting each time the contracts that apply to the component interface.
Redundancy with voting system
In this mode, various components process the same inputs and produce, therefore (in principle) the same output.
The outputs of all components are collected, and then an algorithm is implemented to produce the final result. The algorithm can be simple (majority) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results.
- Does not load balancing
- Introduces the problem of reliability of the component managing the voting algorithm
This method is commonly used in the following cases:
- Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant
- Systems or several different components performing the same function are used (see Differential redundancy) and for which a better outcome can be achieved by combining the output components (e.g., pattern recognition system using multiple algorithms for better recognition rate.
When the malfunction of a component redundant and after repair, we might want to reintroduce active service, check its effective operation, but the results are used. In this case, the entries are processed by one (or more) components to be reliable. These produce the result operated by the rest of the system. The same entries are also processed by the component is reintroduced said mode shadow. You can check the proper functioning of the component by comparing the results with those products tested components. This method is often used in systems based on voting for it is enough to exclude the component mode “shadow” of the final vote.
The processes that improve the availability
There are two distinct roles in these processes.
The processes that reduce the number of failures —
Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system improves availability. Two processes can play this role:
- The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and made in a proper pre-production), many incidents can be eliminated.
- A process of pro-active management of errors: incidents can often be detected before they occur: the response times increase, etc. A process dedicated to this task and provided with adequate tools (system of measurement, reporting, etc.) may take place even before the incident happens.
By implementing these processes, many incidents can be avoided.
The process reduces the duration of outages
The failure always ends up arriving at that time, the recovery process in case of error is essential for the service to be restored as quickly as possible. This process must have a goal: to allow the user to use a service as quickly as possible. The definitive repair should be avoided because it takes much longer. This process will therefore develop a workaround the problem.
Cluster high availability
A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service while avoiding downtime.
Study: From Wikipedia, the free encyclopedia. The text is available under the Creative Commons.
- Cloud Computing: The Concept and Examples of its Virtual Services | Part 1 - July 23, 2012
- Why Rapidly Growing Companies Need Cloud Computing | Part 1 - July 22, 2012
- Web Designing Process | Strategic Planning | Part 1 - August 7, 2011