Distributed Systems

How does facebook manage massive amount of data using Cassandra NoSQL database


What is Cassandra

  • Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.
  • Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers).
  • Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

The need for Cassandra

  • Facebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world.
  • Cassandra was designed to fulfill the storage needs of the Inbox Search problem.
  • Inbox Search is a feature that enables users to search through their Facebook Inbox. At Facebook this meant the system was required to handle a very high write throughput, billions of writes per day, and also scale with the number of users. Since users are served from data centers that are geographically distributed, being able to replicate data across data centers was key to keep search latencies down.

Cassandra Data Model and system architecture

  • A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured.
  • Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.
  • Columns are grouped together into sets called column families very much similar to what happens in the Bigtable system.
  • The Cassandra API consists of the following three simple methods. insert(table, key, rowMutation) get(table, key, columnName) delete(table, key, columnName).
  • Typically a read/write request for a key gets routed to any node in the Cassandra cluster. The node then determines the replicas for this particular key.
  • For writes, the system routes the requests to the replicas and waits for a quorum of replicas to acknowledge the completion of the writes.
  • For reads, based on the consistency guarantees required by the client, the system either routes the requests to the closest replica or routes the requests to all replicas and waits for a quorum of responses.
  • Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so.
  • Cassandra uses replication to achieve high availability and durability. Each data item is replicated at N hosts, where N is the replication factor configured “per-instance”.
  • Cluster membership in Cassandra is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism.

For using Cassandra in your software development, please check out


How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?