Introduction to Apache Storm

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
Apache Storm logo
Developer(s) Backtype, Twitter
Written in Clojure & Java
Operating system Cross-platform
Type Distributed stream processing
License Apache License 2.0

The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem. Storm fills that hole. Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has serious limitations:

  • Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
  • Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.
  • Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.

Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn't lose data, scales to huge volumes of messages, and is dead-simple to use and operate? Storm satisfies these goals.

Real Time Data Processing Challenges

Real Time data processing challenges are very complex. As we all know, Big Data is commonly categorized into volume, velocity, and variety of the data, and Hadoop like system handles the Volume and Varity part of it. Along with the volume and variety, the real time system needs to handle the velocity of the data as well. And handling the velocity of Big Data is not an easy task.

First, the system should be able to collect the data generated by real time events streams coming in at a rate of millions of events per seconds. Second, it needs to handle the parallel processing of this data as and when it is being collected. Third, it should perform event correlation using a Complex Event Processing engine to extract the meaningful information from this moving stream. These three steps should happen in a fault tolerant and distributed way. The real time system should be a low latency system so that the computation can happen very fast with near real time response capabilities.

Streaming data can be collected from various sources, processed in the stream processing engine, and then write the result to destination systems. In between, the Queues are used for storing/buffering the messages. To solve this complex real time processing challenge, storm came into picture.

What is Apache Storm?

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data. Storm is a big data framework fine-tuned to handle unbounded data streams. It is a big data processing system similar to Hadoop in its basic technology architecture, but tuned for a different set of use cases.

Apache Storm

Hadoop, the clear king of big-data analytics, is focused on batch processing. This model is sufficient for many cases, but other use models exist in which real-time information from highly dynamic sources is required. Solving this problem resulted in the introduction of Storm from Nathan Marz. Storm operates not on static data but on streaming data that is expected to be continuous. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.

Storm Use Cases

Storm has many use cases:

  • realtime analytics
  • online machine learning
  • continuous computation
  • distributed RPC
  • ETL, and more

Typical Use Cases:

  • Telecom:

    With Storm, telecom providers have access to real-time analysis that makes a big difference to the telecom providers. For this very reason, telecom providers are updating themselves with Apache Storm. Also they use storm for Customer service, Bandwidth allocation etc.

  • Finance:

    Real-time analytics are imperative for banks and Storms fits the requirement perfectly. Storm is used for security monitoring, Fraud Transaction Detection etc.

  • Retail:

    Dynamic pricing is backbone of retail and storm is used for the same. Also used for monitoring, analytics, and ads targeting.

  • Social Networking:

    In twitter, the trends are anlayzed from the tweets. Twitter is an excellent example of Storm’s real-time use case.

  • Web:

    Many web techn companies process customer behaviour on web pages and mobile apps and triggering real-time actions according to the scenarios recorded by the web site/mobile app owner itself

  • Others:

    Storm is used for server event log monitoring/auditing system.
    Advertising companies use Storm to calculate priorities in real time to know which ads to show for which website, visitor and country.
    Music streaming services use it for tracking and analyzing application events, also for recommendations and parallel task execution..

For more Apache Storm use cases click below link:

Apache Storm Design Goals

Five characteristics make Storm ideal for real-time data processing workloads:

  • Scalable: The operations team needs to easily add or remove nodes from the Storm cluster without disrupting existing data flows through Storm topologies.
  • Resilient: Fault-tolerance is crucial to Storm as it is often deployed on large clusters, and hardware components can fail. The Storm cluster must continue processing existing topologies with a minimal performance impact.
  • Extensible: Storm topologies may call arbitrary external functions (e.g. looking up a MySQL service for the social graph), and thus needs a framework that allows extensibility.
  • Efficient: Since Storm is used in real-time applications; it must have good performance characteristics. Storm uses a number of techniques, including keeping all its storage and computational data structures in memory.
  • Easy to Administer: Since Storm is at that heart of user interactions on Twitter, end-users immediately notice if there are (failure or performance) issues associated with Storm. The operational team needs early warning tools and must be able to quickly point out the source of problems as they arise. Thus, easy-to-use administration tools are not a “nice to have feature,” but a critical part of the requirement.

How does Storm differ from Hadoop?

The simple answer is that Storm analyzes realtime data while Hadoop analyzes offline data. In practical, the two frameworks complement one another more than they compete.

Hadoop is fundamentally a batch processing system. Data is introduced into the Hadoop file system (HDFS) and distributed across nodes for processing. When the processing is complete, the resulting data is returned to HDFS for use by the originator. Storm supports the construction of topologies that transform unterminated streams of data. Those transformations, unlike Hadoop jobs, never stop, instead continuing to process data as it arrives.

Hadoop provides its own file system (HDFS) and manages both data and code/tasks. It divides data into blocks and when a "job"(mapreduce jobs) executes, it pushes analysis code close to the data it is analyzing. Basically, Hadoop partitions data into chunks and passes those chunks to mappers that map keys to values. Reducers then assemble those mapped key/value pairs into a usable output. The MapReduce paradigm operates quite elegantly but is targeted at data analysis. In order to leverage all the power of Hadoop application data must be stored in the HDFS file system.

Storm is interested in understanding things that are happening in realtime meaning right now and interpreting them. Storm does not have its own file system and its programming paradigm is quite a bit different from Hadoop's. Storm is all about obtaining chunks of data, known as spouts, from somewhere and passing that data through various processing components, known as bolts. Storm's data processing mechanism is extremely fast and is meant to help you identify live trends as they are happening.

But in big data environment, some use cases have shown that Storm and Hadoop can work beautifully together. For example, in telecom, live data can be anlysed using storm and can be stored in HDFS for further analysis and generate report. Similarly in ad company, you might use Storm to dynamically adjust your advertising engine to respond to current user behavior, then use Hadoop to identify the long-term patterns in that behavior.