ICT-Hotlist Topic

Back to the ICT-Hotlist...

By Johan van Soest.

Published : 2016-12-09.

Last updated : 2017-05-02.

Business Continuity Management Example: selection process failure preparation.

When you look at picture 1, the same decision arguments (Business Impact and Critical Time span) can be seen as in Picture 1 in the article "Business Continuity Management Example: selection process disaster / calamity preparation." These arguments are now applied to the situation that one of the server systems fails at the primary location. Of course you can have other or more decision arguments in your organization.

Picture 1. Example: selection process failure preparation.

Possible solutions

Virtualisation

When you own virtualization technology such as VMware or Hyper-V, this opens additional opportunities for high availability of the server systems. By deploying multiple identical redundant hardware servers, a virtual server can be moved (using VMotion or "Live Migration") to another server with sufficient capacity at an impending failure of a server or hardware maintenance. Since virtualization makes it easy to quickly add virtual server systems, one must ensure that there is enough spare capacity to handle failures.

Planning type	Description
No Spare	No spare server system can be a valid choice when thorough analysis shows that the daily business minimally depends on this server system. Only in the event of a failure replacement parts or systems are purchased. The main advantage of this strategy is the low preparation cost.
Bilateral/Shared arrangements	In the case of a hosting situation, for example, you can fall back on spare (virtual) machines that your hoster has "on the shelf" to solve his own system failures. You can also contact an organization to maintain a spare server system and share the costs. Calamity service providers rely on this principle in order to manage costly systems efficiently considering the fact that their customers have a wide regional spread. Both parties of the shared arrangements may not use these spare resources for normal production environments.
Back-up Location	Should you have a secondary-or backup site, you can decide to move an installed server system to the primary location, configure it, load the latest data and start it. It is also worth considering to analyse the possibility to run this one server system service directly from the secondary / backup site. Of course you should synchronise the updates of both (production) systems.
Spare system	Spare systems are fully configured at the primary location and are waiting until they are activated after the failure of a production server system. There must be a good procedure to update the spare system as they cannot be online together with the production systems.
Fault tolerance/HA	The fastest available deployable solutions are fault-tolerant systems. By building redundancy into systems (multiple power supplies, network cards, disk RAID levels, SAN) or building active-active clusters, the risk of loss of the server systems by failures can be significantly reduced. With this kind of solutions one should be alert for Single Point of Failures (SPoF) that impact the redundancy and fail a HA server system.

Glossary

Calamity:

A calamity mostly arises from an incident of external origin that expands across multiple hardware systems, software systems or even employees. These incidents include:

Fire
Economic boycotts
Epidemics
Hacking
Floods
Sabotage
Software errors
Storm damages
Failing software updates

Of course a failure can result in a calamity. For example, a server that is running too hot for a long time and reports "overheating" as a failure can eventually result in a calamity such as a fire.

Cluster:

A cluster is a server system that is built using multiple servers in the same location that act as one server system and the service is offered in such a way that the user of the server system thinks he communicates with just one server system. Clusters are made up of nodes, each node is a server system. There are two types of clusters. The active-passive cluster has a node that handles the user requests, and the passive nodes monitors that the active node is still functioning properly. The active-active cluster has multiple active nodes that process the user requests. Of course, an active-active cluster should be configured that in case of failure the total processing power of the remaining nodes remains high enough to keep the users pleased.

Primary-location:

Computer Centre or room housing the systems that daily process and store information.

Production-location:

Refer to primary location.

Secondary-location:

This term is interchangeable with the term "fail-over location". This is the location where the Business Continuity systems are prepared to take over the workload in case of a calamity.

Server system:

Complete but limited combination of software and hardware to support a task for a user. For example: a file storage service consisting of a server with many hard drives and an operating system for file storage and an application for version-control.

Failure:

In short: a failure is defined as a problem related to a program or a hardware component. Disturbances/failures can be solved with redundancy and high availability systems (HA)

Fail-over-location:

Refer: Secondary location.

You may vote your opinion about this article:

Scripts and programming examples disclaimer

Unless stated otherwise, the script sources and programming examples provided are copyrighted freeware. You may modify them, as long as a reference to the original code and hyperlink to the source page is included in the modified code and documentation. However, it is not allowed to publish (copies of) scripts and programming examples on your own site, blog, vlog, or distribute them on paper or any other medium, without prior written consent.
Many of the techniques used in these scripts, including but not limited to modifying the registry or system files and settings, impose a risk of rendering the Operating System inoperable and loss of data. Make sure you have verified full backups and the associated restore software available before running any script or programming example. Use these scripts and programming examples entirely at your own risk. All liability claims against the author in relation to material or non-material losses caused by the use, misuse or non-use of the information provided, or the use of incorrect or incomplete information, are excluded. All content is subject to change and provided without obligation.