Distributed Systems

5 important failures to keep in mind while developing distributed systems


Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:

  • Processes may fail
  • There is no good way to tell that they have done so 

Byzantine or arbitrary failures

In this case a server can send a message stating that fact φ is true to some servers, but to others it may reply that fact φ is false. Apart from that, the server might be able to forge messages from other servers, i.e.: saying that according to server S1 fact φ is true when it’s actually false according to S1.

In a Byzantine fault, a component such as a server can inconsistently appear both failed and functioning to failure-detection systems, presenting different symptoms to different observers. It is difficult for the other components to declare it failed and shut it out of the network, because they need to first reach a consensus regarding which component has failed in the first place.
Byzantine fault tolerance (BFT) is the dependability of a fault-tolerant computer system to such conditions. It has applications especially in cryptocurrency.

Performance failures

This is one is pretty simple to understand. While the server is delivering the correct values, they arrive at the wrong time, either early or late.

Omission failures

This is a special case of the previous one. The server is replying “infinitely late.

Crash failures

When a server suffers from an omission failure and then stops responding.

Fail-stop failures

In this type of failure, the server only exhibits crash failures, but at the same time, we can assume that any correct server in the system can detect that this particular server has failed.

Things that you should ask yourself and know while implementing distributed systems

  • How you decide whether an event happened before another event in the absence of any shared clock. The solution to this lies in Lamport Clock, It’s generalisation of vector clock and Dynamo research.
  • What is the impact of single failure on overall correct implementation of distributed systems.
  • What are the different models of time (synchronous, partially synchronous and asynchronous).
  • Detecting failures is a fundamental problem of distributed computing.


How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments