What Is High Availability?
High availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period. Availability refers to the ability of the community of users to access the system, submit new jobs, update or alter existing research or collect the results of previous work. If a user can not access the system is said to be unavailable. The term downtime is used to define when the system is not available.
Planned (Scheduled) and Unplanned (Unscheduled) Downtime
Typically, planned downtime is a result of maintenance that is detrimental to the operation of the system and usually can not be prevented with currently installed system configuration. Generating events planned downtime may include software patches that require a system reboot or change system settings that take effect after a reboot. In general, planned downtime is usually the result of an event management software or initiated.
Unplanned downtime arising from a physical event such as hardware failure or environmental anomalies: Examples of events with unplanned downtime include power failures, failures in the components of CPU or RAM, a fall due to overheating, a logical or physical breakdown in the network connections, security breaches or catastrophic operating system failure, applications and middleware.
Many posts computational exclude planned downtime availability calculations, assuming, rightly or wrongly, that the time of unplanned activity has little or no impact on the community of computer users. Excluding planned downtime, many systems can claim to have high availability phenomenal, which gives the illusion of continuous availability. Systems that exhibit true continuous availability are comparatively rare and expensive, and they have designs carefully implemented to eliminate single points of failure and allow the hardware, network, operating system, middleware and application upgrades, patches, and replacements are made in line.
Techniques For Improving The Availability
Many techniques are used to improve availability:
- Redundant hardware and clustering;
- Data security: RAID, snapshots, BCV (Business Copy Volume), Oracle Data Guard, SRDF (Symmetrix Remote Data Facility), DRBD;
- The ability to reconfigure the “hot” (that is to say when it works);
- Limp or a panic mode;
- Rescue plan;
- And secure backups: outsourcing, centralization third party site.
Two additional means are used to improve high availability:
- The establishment of a dedicated physical infrastructure, generally based on hardware redundancy. This will create a cluster of high-availability (as opposed to a computing cluster): a cluster of computers whose goal is to provide a service whilst avoiding downtime.
- The establishment of appropriate processes to reduce errors, and accelerate recovery in case of error. ITIL contains many such processes.
Estimated high availability percentage
To measure the availability, use is often a percentage mainly composed of ‘9 ‘:
- 99% means that the service is unavailable less than 3.65 days per year
- 99.9%, less than 8.75 hours per year
- 99.99%, less than 52 minutes per year
- 99.999%, less than 5.2 minutes per year
- 99.9999%, less than 54.8 seconds per year
- 99.99999%, less than 3.1 seconds per year
- Etc.
Availability is usually expressed as a percentage of operating time in a given year. In a given year, the number of minutes of unplanned downtime is registered to a system, unplanned downtime aggregate is divided by the total number of minutes in a year (about 525,600), producing a percentage of downtime , the complement is the percentage of operating time which is what we call availability. Common values of availability, typically stated as a number of “nines” for highly available systems are:
- 99.9% = 43.8 minutes / month or 8.76 hours / year (“three nines”)
- 99.99% = 4.38 minutes / month or 52.6 minutes / year (“four nines”)
- 99.999% = 0.44 minutes / month or 5.26 minutes / year (“five nines”)
It should be noted that uptime and availability are not synonymous. A system may be running and not available as in the case of a power failure. You can see that these values of availability are visible mostly in sales and marketing documents, rather than a technical specification fully measurable and quantifiable.
Measurement and interpretation
Clearly the availability measure is subject to some degree of interpretation. A system that has been in operation for 365 days in a non-leap year may have been overshadowed by a power failure that lasted 9 hours during a peak usage period; the user community will see the system as unavailable, as the administrator system claim the 100% “uptime.” But following the true definition of availability, the system will be available approximately 99,897% (8751 hours of time out of the 8760 hour non-leap year).
Systems also experiencing performance problems are often assessed as wholly or partially unavailable for users while administrators may have a different (and probably wrong, certainly in the business sense) perception. Similarly unavailability of non-selected features may go unnoticed for administrators but users could be devastating for a true measure of availability is integral.
Availability must be measured to be determined, ideally with comprehensive monitoring tools (“instrumentation”) that are themselves highly available. If there is a lack of instrumentation, systems supporting a high volume transaction processing throughout the day and night such as credit card processing and telephone switches are monitored frequently and inherently better, at least for the same users, systems that experience periodic pauses in the application.
High Availability Related concepts
Recovery time is closely related to availability, which is the total time required for a planned outage or recovery time required to complete an unplanned outage. Recovery time can be infinite with certain designs and system crashes, recovery is impossible. One such example is a fire or flood that destroys a data center and systems when there is no secondary data center for disaster recovery.
Another related concept is data availability, which is the extent to which databases and other systems for storing information that accurately record and report transactions of the system. Management specialists often focus separately information on the availability of data to determine acceptable data loss or current events with multiple failures. Some users can tolerate service interruptions in the application but no data loss
Continued.