On a project for a client, we faced a problem where our application wasn’t scaling sufficiently. At the end of this post, I’ll let you know which framework we chose, and why. We’ll focus less on comparing the frameworks and more on what I had to learn just to participate in the evaluation necessary to select a streaming framework. While I do provide brief summaries of several frameworks, there are many blog posts, articles and websites describing streaming frameworks and offering in-depth comparisons.
As mentioned previously, we faced a problem where our application wasn’t scaling sufficiently. The bottleneck was Elasticsearch. Another application was ingesting data from networked physical devices, placing transformed data into Elasticsearch, and then sending our application a message via Kafka. Once we received the message, we retrieved the data from Elasticsearch, performed operations on it, and persisted the results back to Elasticsearch.
As the application grew, performance times became untenable. We first turned to the low-hanging fruit offered by caching. While we greatly improved our performance, it still wasn’t enough. We then used a stream processing framework to tap into the existing data flow, saving us the pain of multiple reads from Elasticsearch.
First, some common terms you’ll need to know when discussing streaming frameworks:
- message/record/event/element: a piece of data
- collections: datasets, i.e., a defined set of records
- pipelines: a series of operations performed on data
- transformations: a data processing step in a pipeline
- data sources: an I/O location from which data is read, often the beginning of a pipeline
- data sinks: an I/O location to which data is written, often the end of a pipeline
- Hadoop Distributed File System (HDFS): a distributed Java-based file system for storing large volumes of data
- Hadoop: a distributed data infrastructure using the HDFS
- MapReduce: process large data sets in parallel using map operations (e.g., filtering, sorting) and reduction operations (e.g., aggregation, transformation) on the data. MapReduce is a general term, but can also specifically represent the processing component available to Hadoop.
How does the streaming framework manipulate the data stream?
- Batch Processing: Can the framework perform one or more operations on a collection of data, i.e., on bounded data?
- Stream Processing: Can the framework act continuously on single records as they arrive, i.e., on unbounded data?
- Micro-batch Processing: Some frameworks act on small groups of records, smaller in size than a batch, but larger than an individual record.
- Aggregation: Does the framework support the grouping of data, such as by averaging, counting, summing, etc.?
- Interactive Queries: Within the pipeline, can we retrieve data from a datastore?
- Windowing: Does the framework allow us to do something over regular time intervals of data, e.g., every hour, calculate a rolling average of the last 24 hours worth of data?
How do messages flow through a pipeline?
- Guarantees: How often will a message be delivered or an operation performed (e.g., exactly once, at least once, etc.)? Can events be replayed if necessary?
- Handling out-of-order events: What does the framework do when events do not arrive in chronological or sequential order?
- Delayed events: Is it problematic if there is a delay between when an event is generated and when it is ingested into a pipeline?
- Joining event streams: Can parallel streams be joined?
- Sorting and filtering events: How does the framework filter events? Can the framework sort collections?
Resilience & Performance
- What happens when the volume of messages increases?
- Scalability: How does the framework handle a growing amount of work?
- Infrastructure compatibility: How well does the framework mesh with our current technologies?
- Fault tolerance: How the does the framework handle a cluster failure, an application restart or bad data?
- Latency: How much time transpires between entering the pipeline and exiting the pipeline?
- Throughput: How many records can be processed in a given period of time?
How to integrate a streaming framework into your application.
- Reusability of functionality: Is existing work easy to leverage? Is the work composable?
- Testability: How easy is it to test implementations of the framework? How long do tests take to write, and how long to run?
- Learning curve: How long does it take to develop a minimum proficiency with the technology? How much more effort does it take to become an expert in the technology? What documentation and learning resources are available?
- Community support: How large and active is the technology’s community? Is the technology backed by a company?
A Word About Mesos and YARN
Apache Mesos vs Apache YARN: I bring this up because some streaming frameworks rely on Mesos, while others rely on YARN. Both Mesos and YARN are resource management tools for distributed systems (physical or virtual machines). All available machines become a single pool of resources available to run jobs.
In Mesos, a client submits a job to a Mesos master, which then queries Mesos agents on each physical or virtual machine to determine which resources are available. Each agent will respond to the master, offering available resources. The master will then accept the most appropriate offer and will place the job with that agent.
YARN is more monolithic. It knows about the resources available on its nodes and doesn’t need to query them. A job’s ApplicationMaster will request a resource from YARN’s global ResourceManager, which will then place the job with an appropriate node.
Mesos has better scaling capabilities, while YARN has better scheduling capabilities.
Some Streaming Frameworks in Brief
Apache Beam is “a unified API that allows you to express complex data processing workflows“, running streaming pipelines on Apache Flink and Apache Spark (when running locally and in a non-Google cloud), or on Google Cloud Dataflow (when running on the Google Cloud Platform). The project is in incubation at Apache, with the code base having been donated by Google.
Beam aims to become “the JDBC for streaming”, meaning it is to provide a common abstraction for talking about streams and interaction with streams, regardless of the underlying streaming implementation (Flink, Spark or Dataflow). It is available as a Java SDK, but a Python SDK is under development.
Apache Flink is a parallel and distributed stream processing framework with decreased latency (as it is a true streaming technology), intended as an alternative to Hadoop’s MapReduce. It provides fault tolerance, query optimization, automatic caching, automatic data indexing and more complex analytical tools such as joins, unions, and iterations. Flink runs on YARN or Mesos and can read/write from/to HDFS or other data storage systems.
Flink guarantees exactly-once delivery (through a checkpointing mechanism), supports windowing, and can compute over streams where events are delayed or arrive out of order, and can maintain state. Iterative computations allow for machine learning and graph analysis.
Batch processing was added later, on top of the streaming engine. Apache and Google continue to develop Flink with an eye towards making the Beam and Flink APIs compatible. Programs may be written in Java or Scala, and there are a growing number of libraries.
Flink programs look like regular programs that transform collections of data. Each program consists of the same basic parts:
- Obtain an execution environment
- Load/create the initial data
- Specify transformations on this data
- Specify where to put the results of your computations
- Trigger the program execution
The May release of Kafka 0.10 included a new component: Kafka Streams. It is a library, not a framework, so it is lightweight and deployment-agnostic. You simply include it in your Java application, and deploy/run it however you deploy/run that application.
Kafka Streams offers processing of unordered events, replay of events, several joining and aggregation operators, data source connectors to datastores, high scalability, fault tolerance and general simplicity. It currently offers at-least-once delivery, but Apache expects to have exactly-once delivery soon.
Windowing operations are available where users can specify a retention period for the window based on event time (defined by whatever creates the event), ingestion time (when the event is stored into Kafka), and processing time (when the event is processed). This allows Kafka Streams to retain old window buckets for a period of time in order to wait for the late arrival of records whose timestamps fall within the window interval.
Kafka Streams operates on unbounded streams only and thus offers no batching. It does provide the concept of KTables, which is essentially a view of a stream (called a KStream) aggregated into a table over a period of time. It is possible to inner/outer/left join two KStreams, a KStream to a KTable or two KTables.
Apache Spark (2.0)
Apache Spark is an in-memory batch engine that executes streaming jobs as a series of mini-batches.. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. The initial impetus for developing Apache Spark was to run iterative training algorithms for machine learning. It is thought to be the successor to the MapReduce component of Hadoop.
The API is centered on a data structure called the resilient distributed dataset (RDD). RDDs are immutable collections of elements, and fault-tolerance is achieved by keeping track of the sequence of operations that produced them. RDDs can be created from Hadoop InputFormats (such as HDFS, Cassandra, Amazon S3, etc.) or by transforming other RDDs. These datasets have actions, which return values, and transformations, which return pointers to new RDDs. Event delivery is exactly once and windowing operations are possible. The framework is fault-tolerant through checkpointing. Streams can be joined with other streams and other datasets.
Spark Streaming originally ingested data in micro-batches. This allows code written for batch analytics to be used for streaming analytics. The downside is increased latency, while records are collected into the micro-batch. However, this type of processing is good for iterative algorithm, which visit the dataset multiple times in a loop. Now Spark also offers processing of unbounded data.
For cluster management, Spark supports a standalone cluster, YARN, or Mesos. Applications can be written in Java, Python, Scala. Or, R. Spark includes a machine learning library and a graph processing framework.
Google Cloud Dataflow
The Dataflow SDK offers the typical functionality of a streaming framework. PCollection represents all data, whether a fixed size batch or unbounded stream. PCollections are the inputs and outputs for each step in your pipeline. PTransforms are a library of parallel transformations that alter these collections at each step. Pipelines can be branched and joined, and windowing is available. Delivery is guaranteed to be exactly once.
What sets Dataflow apart is that it is run in the cloud and is offered as a managed service. Resource management and performance optimization are automatic. Applications are written in Java or Python and when run, users are billed per minute, on a per job basis, for CPU, memory, and local storage. Amazon offers a similar cloud-based streaming service product called Amazon Kinesis.
LinkedIn built Samza as a replacement for Hadoop and it became an incubating project at Apache in September 2013. It operates on unbounded data, using Apache Kafka for at-least-once messaging, and YARN or Mesos for distributed fault tolerance and resource management. Jobs consume and process data, and all messages are sent between jobs via Kafka, so jobs can process in isolation without losing their output when downstream jobs are unavailable.
Jobs are broken down further into tasks, and each task has its own datastore to maintain state locally. This datastore is replicated on other machines in the cluster to allow restoration of processing if necessary. Applications are written in Java and Scala. Windowing operations are available.
Twitter Heron and Apache Storm
Twitter created Heron as a replacement for Apache Storm, which wasn’t offering them the scaling and throughput they needed. They didn’t turn to another existing framework, because they would have had to rewrite all their code. Instead, they created a new framework with an API compatible with Storm.
Heron runs on Apache Aurora, a service scheduler that runs on top of Mesos, in a clustered environment. Applications are written in Java or Scala, though interestingly, Heron itself is written in C++ to allow for faster garbage collection.
Directed acyclic graphs called topologies are used to process streams of data. The individual pieces of data are called tuples (and they implement Java’s Tuple interface). The source of tuples is called a spout. Spouts send the tuples to bolts, which act on the data. A topology manager runs the topology by starting several containers.
Heron provides support for both at-most-once and at-least-once processing semantics.
I would have chosen Flink (perhaps run under Beam), as it seemed to offer everything we were looking for. The team, however, ruled it out. As a distributed framework, deployment and operation would have been different from our current process and would have put unacceptable strain on our DevOps team. Flink also does not yet support the versions we use of Gradle, Mesos, Akka or Kafka. We also run inside Docker containers, so there would be no auto-discovery of nodes.
The team then ruled out Spark 2.0 because the technology continues to be in flux. Upcoming functionality looks promising, but won’t be backwards compatible. Thus, anything we would write now may have to be rewritten after future Spark releases. Spark does have better support for R and machine learning, so we would reconsider Spark if those become more important to our project.
We didn’t evaluate any other framework in depth, except for the one we chose: Kafka Streams. It provides the functionality we need, fits in well with our existing technology stack (which includes Kafka), is lightweight, and is backed by Confluent, with whom the client has a service agreement. We can use third party software in the same manner and can use our current deployment process.
Note: this post uses some research performed by members of my team. Since I cannot name the client, I cannot give specific thanks and credit to the guys on the team who contributed to this post. But thanks guys!