- GETTING STARTED
- BASIC JAVA
- JAVA STRINGS
- EXCEPTION HANDLING
- JAVA 9
- EFFECTIVE JAVA
There are a number of components in a Storm topology. The throughput (processing speed) of the topology is decided by the number of instances of each component running in parallel. This is known as the parallelism of a topology. To understand how parallelism works, we must first explain the main components involved in executing a topology in a Storm cluster:
Here is a simple illustration of their relationships:
Lets take word count topology as example. Storm will default most parallelism settings to a factor of one. Assuming we have one machine (node), have assigned one worker to the topology, and allowed Storm to one task per executor, our topology execution would look like the following:
The only parallelism here we have is at the thread level. Each task runs on a separate thread within a single JVM. How can we increase the parallelism to more effectively utilize the hardware we have at our disposal? Let's start by increasing the number of workers and executors assigned to run our topology.
Assigning additional workers is an easy way to add computational power to a topology, and Storm provides the means to do so through its API as well as pure configuration. To increase the number of workers assigned to a topology, we simply call the
setNumWorkers() method of the
Config config = new Config(); config.setNumWorkers(2);
This assigns two workers to our topology instead of the default of one. While this will add computation resources to our topology, in order to effectively utilize those resources, we will also want to adjust the number of executors in our topology as well as the number of tasks per executor.
By default, Storm creates a single task for each component defined in a topology and assigns a single executor for each task. Storm's parallelism API offers control over this behavior by allowing you to set the number of executors per task as well as the number of tasks per executor.
Let's modify our topology definition to parallelize
SentenceSpout such that it is assigned two tasks and each task is assigned its own executor thread:
builder.setSpout(SENTENCE_SPOUT_ID, spout, 2);
Now our topology looks like this with one worker..
Next, we will set up the split sentence bolt to execute as four tasks with two executors. Each executor thread will be assigned two tasks to execute (4 / 2 = 2). We'll also configure the word count bolt to run as four tasks, each with its own executor thread:
builder.setBolt(SPLIT_BOLT_ID, splitBolt, 2) .setNumTasks(4) .shuffleGrouping(SENTENCE_SPOUT_ID); builder.setBolt(COUNT_BOLT_ID, countBolt, 4) .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
With two workers, the execution of the topology will now look like the following diagram:
With the topology parallelism increased, running the updated WordCountTopology class should yield higher total counts for each word. Since spout emits data indefinitely and only stops when the topology is killed, the actual counts will vary depending on the speed of your computer and what other processes are running on it, but you should see an overall increase in the number of words emitted and processed.
It's important to point out that increasing the number of workers has no effect when running a topology in local mode. A topology running in local mode always runs in a single JVM process, so only task and executor parallelism settings have any effect. Storm's local mode offers a decent approximation of cluster behavior and is very useful for development, but you should always test your application in a true clustered environment before moving to production.
Note:Storm currently has the following order of precedence for configuration settings: defaults.yaml, storm.yaml, topology-specific configuration, internal component-specific configuration, external component-specific configuration.
one of the key features of Storm is that it allows us to modify the parallelism of a topology at runtime. The process of updating a topology parallelism at runtime is called rebalance. If we add new supervisor nodes to a Storm cluster and don't rebalance the topology, the new nodes will remain idle.
There are two ways to rebalance the topology:
The Storm Web UI will be covered in detail in the next chapter. Here is an example of using the CLI tool:
## Reconfigure the topology "mytopology" to use 5 worker processes, ## the spout "blue-spout" to use 3 executors and ## the bolt "yellow-bolt" to use 10 executors. $ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10