Distributed Systems

How to detect failures in large scale distributed systems


Failure Detectors Use Cases

  • Maintaining a timely view of the current system status is essential to the performance and functionality of distributed systems. Failure detectors have long been essential to distributed systems.
  • The need for failure detectors arose early in the development of distributed systems to address the unreliability of asynchronous networks.
  • Despite recent industry trends towards utilising huge amounts of commodity hardware with commonplace failures, there is still the inherent need to respond and adjust to failures.
  • Dynamo’s failure detector relies on pinging with a weak eventual completeness model based on randomisation.
  • The classic gossip-protocol is based instead on heartbeats with a strong completeness model.
  • Failure detectors must also adapt to the demands of large scalable systems with common failures.


Two basic properties of failure detection

  • Completeness – There is a time after which every process that crashes is permanently suspected by {some/all} correct process. The completeness model is also divided into weak and strong consistency where a weak system only needs a single correct process to suspect failure but a strong system requires all correct processes to suspect the failure. Crashes will be learned.
  • Accuracy – There is a time after which some correct process is never suspected by any correct process. Crashes must be learned efficiently.

Two ways of detecting liveliness

  • Heartbeat A node will send out messages every designated time period to alert others that it is still alive.
  • Pinging A node will ask other nodes whether they are alive, and if they reply in some timely manner, then the original node is satisfied that the other node is alive.

Failure Detectors Where to use what

  • Failure detectors maintain group membership but the gossip and Dynamo protocols are aimed at quite different systems.
  • The gossip protocol has a short failure detection delay but extremely sensitive to message loss.
  • Dynamo failure detector gives no guarantee of failure detection delay but handled message loss solidly. Its scalability and unsureness makes it ideal for large systems that have a weak consistency model. As distributed systems continue to grow

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments