Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
- Processes may fail
- There is no good way to tell that they have done so
Byzantine or arbitrary failures
In this case a server can send a message stating that fact φ
is true
to some servers, but to others it may reply that fact φ
is false
. Apart from that, the server might be able to forge messages from other servers, i.e.: saying that according to server S1 fact φ
is true
when it’s actually false
according to S1.
In a Byzantine fault, a component such as a server can inconsistently appear both failed and functioning to failure-detection systems, presenting different symptoms to different observers. It is difficult for the other components to declare it failed and shut it out of the network, because they need to first reach a consensus regarding which component has failed in the first place.
Byzantine fault tolerance (BFT) is the dependability of a fault-tolerant computer system to such conditions. It has applications especially in cryptocurrency.
Performance failures
This is one is pretty simple to understand. While the server is delivering the correct values, they arrive at the wrong time, either early or late.
Omission failures
This is a special case of the previous one. The server is replying “infinitely late.
Crash failures
When a server suffers from an omission failure and then stops responding.
Fail-stop failures
In this type of failure, the server only exhibits crash failures, but at the same time, we can assume that any correct server in the system can detect that this particular server has failed.
Things that you should ask yourself and know while implementing distributed systems
- How you decide whether an event happened before another event in the absence of any shared clock. The solution to this lies in Lamport Clock, It’s generalisation of vector clock and Dynamo research.
- What is the impact of single failure on overall correct implementation of distributed systems.
- What are the different models of time (synchronous, partially synchronous and asynchronous).
- Detecting failures is a fundamental problem of distributed computing.