One Good Read
Posts
Mitigating Single Points of Failure

Mitigating Single Points of Failure

What could go wrong usually does.

Olivier Reuland
September 25, 2023

Availability

We need our service to be available for our clients to use, processes to work reliably, and staff to be available when we need them. But Edward A. Murphy Jr. is here to remind us things don't always go as planned.

" Anything that can go wrong will go wrong. Murphy's First Law

Anything that can go wrong will go wrong.

Edward A. Murphy Jr, Murphy's First Law

Failures in systems can create risks to availability. One of the ways we can mitigate this is by adding redundancy. If a component fails, another one is here to compensate. And this can be done for any "components" of a system, whether it is people, processes, or technology.

👥 People

Someone executing a task.

📃 Process

A process which is followed to execute a task.

⚙️ Technology

A technology component, for example, a server or a service, which executes a task.

Luckily, we can do something about this. We can design our systems to use Defence-In-Depth, adding additional layers of controls so that if one fails, others will save the day. But what are these layers? Let's investigate this.

Single Point of Failure (SPF)

At first, we have a Single Point of Failure (SPF). If this component fails, the risk is realised, and the impact is perceived.

Single Point of Failure

Any failure of the system can make this system fail.

👥 People

Someone executing a task.
Example: An employee is sick, and nobody else can do their tasks.

📃 Process

A process which is followed to execute a task.
Example: A process fails, and the expected result is not delivered.

⚙️ Technology

A technology component, for example, a server or a service, which executes a task.
Example: A server is faulty, and the service is not accessible anymore.

High Availability (HA)

To increase the availability of the system, we can implement High Availability (HA) by adding a secondary redundant element so that if the first one fails, the second can take over.

High Availability (HA)

This is typically when two or more identical controls are in place so that if one fails, the other(s) can take over immediately.

👥 People

Two people are executing the same task.
Example: Two guards guarding an entrance with the right skills to do so. This way, when one goes on break, the other can still guard the entrance.

📃 Process

Two different processes exist and run in parallel.
Example: Quality control has two separate ways to verify a given part meets the quality requirements or two different algorithms to verify the same results in a satellite.

⚙️ Technology

Two redundant hardware/software components fulfil the same role.
Example: Two servers are providing the same service so that if one fails, the other can take over.

Disaster Recovery (DR)

Disaster recovery (DR) is another way to mitigate the risk of failure. In this instance, rather than always having two or more redundant elements available, like for the High Availability scenario above, we define a process which will be activated when a disaster occurs, and we can't recover with the primary system.

Disaster Recovery (DR)

👥 People

A second person/team is available on-call.
Example: Someone falls ill during the day, and we call someone else in to take over.

📃 Process

A second procedure is ready if the main procedure fails.
Example: The traffic lights don't work anymore, so we dispatch a policeman to handle the traffic.

⚙️ Technology

Replacement hardware is ready if the primary fails.
Example: A data centre has an outage, and we start the servers in a secondary data centre.

Business Continuity Planning (BCP)

BCP is invoked when everything else has failed to maintain operations despite the outage. This should be a rare occurrence, reserved for grave emergencies (earthquakes, general power failures, global pandemic…) as the service will usually deteriorate.

Business Continuity Planning (BCP)

👥 People

People have been trained for what to do in these scenarios.
Example: A guide in a museum falls ill during the day, and we don't have anyone to replace them, so we put a sign to direct people to a self-guided tour.

📃 Process

Adjusted processes are followed to accommodate the deterioration of people and/or technology.
Example: The POS terminal doesn't work anymore, so we can only accept cash transactions and document them by hand.

⚙️ Technology

Key business processes have been tested under typical crisis scenarios.
Example: Our telephone system has been hacked, and we can't use it anymore. Staff use their mobile phones in the meantime.

Let's dig a bit deeper

Understand Your Risks

What is your current risk profile? For this, you need to understand what your most valuable assets are. This will depend on your business:

Services you offer internally that run your operations.
Services you offer externally to your clients.
Data you hold, such as your intellectual property (IP), personal information (PII), and client data.
Health of your staff.

Once you understand your assets, you can calculate the Inherent Risk: This is the risk that a threat eventuates without any additional control in place, for example, the risk of a component failing, a process not being followed, or the risk of an entire datacenter being unavailable.

Based on this information, you can then decide which controls to put in place to reduce this risk to an acceptable level.

Choose Your Defences

You don't have to have all layers of defence in place. This will depend on

Inherent Risk: If it's already at or below your risk threshold, then you might decide not to do anything.
Risk reduction: This considers how effectively a new control will reduce the risk. An approach might not offer a significant risk reduction for the cost or effort to implement it or the operational impact it would have.
Cost of controls: We need to consider the cost of the controls. It can be in dollars, work hours to implement, or work hours to operate. Having a fully replicated, highly-available service can cost a lot of money; we need to double the servers and storage, advanced services to handle the state, etc. This can have a significant cost and must be measured against an outage's cost.

The greater the risk reduction and the lower the control's cost, the higher the value.

Control value = Risk reduction / Control cost

For example, would it be wise to implement a new control that would take $100k to implement and $50k/year to operate when the possible impact reduction is an estimated $200k loss on a risk with a likelihood of once every ten years? Over ten years, the cost would be $600k, while the exposure would only be around $200k. So, it's not worth it, based on these factors alone.

High Availability Strategies

There are two main scenarios for high-availability designs: Active-Active and Active-Passive. The right option depends on the situation.

Active-Active

The two components are effective at the same time.

👍 Pros:

Increase of performance
Reduce the risk of data loss

👎 Cons:

Can be more complex
Can be more expensive
Not always practical

Active-Passive

The second component is passive, waiting to jump in at any moment.

👍 Pros:

Possibly cheaper

👎 Cons:

Data loss is possible

High Availability vs Disaster Recovery

High Availability usually entails two or more elements being available simultaneously, with an automatic failover between them.

Disaster Recovery is usually triggered manually. The Disaster Recovery process might be automated, for example, by having scripts that start the DR servers and redirect the traffic to them, but there are often manual steps in the process. It's also likely that a company will experience data loss and a longer outage during a Disaster Recovery scenario compared to HA.

The two components are effective at the same time.

👍 Pros:

Automatic failover
Likely to be transparent to users
Less likely to incur data loss

👎 Cons:

More complex
More expensive
Not always possible

The second component is passive, waiting to jump in at any moment.

👍 Pros:

Often cheaper than HA

👎 Cons:

Manual failover
Outage is likely
Some data loss is likely

Reply

or to participate.