Design of a high availability system
Ironically, adding more components to the total system can undermine efforts to achieve high availability. This is because complex systems are inherently more potential failure points and are more difficult to implement correctly.
Techniques for improving the availability
Many techniques are used to improve availability:
- Redundant hardware and clustering;
- Data security: RAID, snapshots, Oracle Data Guard, BCV (Business Copy Volume), SRDF (Symmetrix Remote Data Facility), DRBD;
- The ability to reconfigure the “hot” (that is to say when it works);
- Limp or a panic mode;
- Rescue plan;
- And secure backups: outsourcing, centralization third party site.
High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against malicious acts and theft. Attention also to the risk of fire and water damage. Power cables and communication must be multifaceted and buried. They should not be prominent in the underground garage of the building. These criteria are the first to come into account when choosing a hosting provider (if renting a local high availability).
For each level of the architecture for each component, each connection between components must be established:
- How to detect a failure? Examples: Tests of Life Health Check TCP implemented by a housing Alteon, relied on periodic testing program (heartbeat), interface type “diagnosis” on the components, etc.
- How secure is component, redundant, rescued, etc. Examples: backup server, cluster system, Websphere clustering, RAID storage, backup, SAN double attachment, degraded mode, unused material free (spare) ready to be reinstalled. .
- How do we want the trigger switches to backup / gradient — Manually after analysis? or Automatically?
- How to ensure that the emergency system start over on a stable and known. Examples: one starts with a copy of the base and reapplies the archive logs, restart batches from a known state, 2-phase commit for transactions updating multiple data repositories, etc.
- How the application restarts on the backup mechanism. Examples: application restart, restart of interrupted batches, activation of a degraded mode, return the IP address of the failed server by the backup server, etc.
- How to take any transactions or sessions. Examples: Session persistence on the application server, a mechanism to ensure a response to a client for a transaction that was successfully made before failure but for which the customer does not have an answer, etc.
- How to return to the nominal situation. Examples:
- – – – if a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again.
- – – -If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)
Dependence vis-a-vis other applications
For an application seeking other applications with middleware in synchronous mode (http web service, Tuxedo, CORBA, EJB) the rate of application availability will be strongly linked to the availability of applications on which it depends. The sensitivity of applications which we depend must be equal to or greater than the sensitivity of the application itself. Otherwise, consider:
- The use of asynchronous middleware: MQ Series, JMS, SonicMQ, CFT
- Implementation of a limp when an application which we depend is failing.
For this reason we emphasize the use of asynchronous middleware to favor a good standby when you can.
Load balancing and sensitivity
Sensitivity is often managed by redundant elements with a load balancing mechanism. (A websphere cluster with an Alteon load-balancing for example). For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.
In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of a single server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). To ensure reliability, it is useless to many servers back each other up. For example, a reliable 99% redundant once gives a reliability of 99.99% (the probability that the two elements is failing at the same time 1/100×1/100 = = 1/10.000)
The redundancy of an element is usually done by choosing redundant with several identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. This is for example the case of hardware failures.
This is not the case for all failures: for example, a flaw in the operating system or software malfunction of a component can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we consider redundant elements with components of different natures but the same functions. This can lead to:
- Choose different kind of servers with different OS, software products for different infrastructure,
- Develop the same component twice respecting each time the contracts that apply to the component interface.
Redundancy with voting system
In this mode, various components process the same inputs and produce therefore (in principle) the same output.
The results produced by all the components are collected, then an algorithm is implemented to produce the final result. The algorithm can be simple (majority vote) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results. This process:
- Do not allow load balancing
- Introduces the problem of reliability of the component managing the voting algorithm
This method is commonly used in the following cases:
- Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant
- Systems or several different components that perform the same function are used (see Differential redundancy) and for which a better outcome can be achieved by combining the results produced by the components (e.g., pattern recognition system using multiple algorithms for better recognition rate.
When the malfunction of a component redundant and after repair, one may wish to reintroduce the active service, check its proper functioning, but the results are used. In this case, inputs are processed by one (or several) components to be reliable. These produce the result operated by the rest of the system. The same inputs are processed by the component is said reintroduced mode shadow. We can verify the proper functioning of the component by comparing the results with those produced reliable components. This method is often used in systems based on voting for it just to exclude the component mode “shadow” of the final vote.
Processes that improve the availability
There are two distinct roles in these processes:
Processes that reduce the number of failures
Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system and which also improves availability. Both processes can play this role:
- The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and implemented in a proper pre-production), many incidents can be eliminated.
- A process of pro-active management of errors: incidents can often be detected before they occur: response time increases, etc. A process dedicated to this task and provided with adequate tools (measuring system, reporting, etc.) may intervene even before the incident happens.
By implementing these processes, many incidents can be avoided.
The process reduces the duration of outages
Breakdowns always happen eventually. At this point, the recovery process in case of error is essential if the service is restored as quickly as possible. This process must have a goal: enabling the user to use a service as quickly as possible. Permanent repair should be avoided because it takes much longer. This process will have to implement a workaround for the problem.
High availability cluster
A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service whilst avoiding downtime.
Study: From Wikipedia, the free encyclopedia. The text is available under the Creative Commons.
- Cloud Computing: The Concept and Examples of its Virtual Services | Part 1 - July 23, 2012
- Why Rapidly Growing Companies Need Cloud Computing | Part 1 - July 22, 2012
- Web Designing Process | Strategic Planning | Part 1 - August 7, 2011