Distributed Systems

MillWheel – a highly scalable and fault tolerant stream processing system


What is MillWheel

  • MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Think of it as a programming model, tailored specifically to streaming, low-latency systems.
  • Users write application logic as individual nodes in a directed compute graph, for which they can define an arbitrary, dynamic topology. Records are delivered continuously along edges in the graph. MillWheel provides fault tolerance at the framework level, where any node or any edge in the topology can fail at any time without affecting the correctness of the result.
  • Every record in the system is guaranteed to be delivered to its consumers. Furthermore, the API that MillWheel provides for record processing handles each record in an idempotent fashion, making record delivery occur exactly once from the user’s perspective. MillWheel checkpoints its progress at fine granularity, eliminating any need to buffer pending data at external senders for long periods between checkpoints.
  • You can create complex streaming systems without distributed systems expertise. And the systems built will be fault tolerant and scalable.

Requirements that a Stream Processing System Millwheel satisfies

  • Data should be available to consumers as soon as it is published (i.e. there are no system-intrinsic barriers to ingesting inputs and providing output data).
  • Out-of-order data should be handled gracefully by the system.
  • A monotonically increasing low watermark of data timestamps should be computed by the system.
  • Latency should stay constant as the system scales to more machines.
  • The system should provide exactly-once delivery of records.
MillWheel data flow

Overview of Millwheel system design

  • MillWheel is a graph of user-defined transformations on input data that produces output data.
  • Abstractly, inputs and outputs in MillWheel are represented by (key, value, timestamp) triples. While the key is a metadata field with semantic meaning in the system, the value can be an arbitrary byte string, corresponding to the entire record.
  • Collectively, a pipeline of user computations will form a data flow graph, as outputs from one computation become inputs for another, and so on.
  • MillWheel makes record processing idempotent with regard to the framework API.
  • All internal updates within the MillWheel framework resulting from record processing are atomically checkpointed per-key and records are delivered exactly once.

Real World Uses of Millwheel

  • Millwheel performs streaming joins for a variety of Ads customers, many of whom require low latency updates to customer-visible dashboards.
  • Google’s Billing system pipelines depend on MillWheel’s exactly-once guarantees.
  • Beyond Zeitgeist, MillWheel powers a generalized anomaly-detection service that is used as a turnkey solution by many different teams.
  • Millwheel is used in network switch and cluster health monitoring systems.
  • MillWheel also powers user-facing tools like image panorama generation and image processing for Google Street View

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?