- GETTING STARTED
- BASIC JAVA
- JAVA STRINGS
- EXCEPTION HANDLING
- JAVA 9
- EFFECTIVE JAVA
A YARN application implements a specific function that runs on Hadoop. MapReduce is an example of a YARN application. A YARN application involves three components—the client, the ApplicationMaster(AM), and the container
Launching a new YARN application starts with a YARN client communicating with the ResourceManager to create a new YARN ApplicationMaster instance. Part of this process involves the YARN client informing the ResourceManager of the Application-Master’s physical resource requirements.
The ApplicationMaster is the master process of a YARN application. It doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it’s responsible for managing the application-specific containers: asking the ResourceManager of its intent to create containers and then liaising with the Node-Manager to actually perform the container creation.
As part of this process, the ApplicationMaster must specify the resources that each container requires in terms of which host should launch the container and what the container’s memory and CPU requirements are. The ability of the ResourceManager to schedule work based on exact resource requirements is a key to YARN’s flexibility, and it enables hosts to run a mix of containers.
The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
With the knowledge of the above concepts, lets see how applications conceptually work in YARN.
Application execution consists of the following steps:
Let’s walk through an application execution sequence (steps are illustrated in the diagram):
The lifespan of a YARN application can vary dramatically: from a short-lived application of a few seconds to a long-running application that runs for days or even months. Rather than look at how long the application runs for, it’s useful to categorize applications in terms of how they map to the jobs that users run. The simplest case is one application per user job, which is the approach that MapReduce takes.
The second model is to run one application per workflow or user session of (possibly unrelated) jobs. This approach can be more efficient than the first, since containers can be reused between jobs, and there is also the potential to cache intermediate data between jobs. Spark is an example that uses this model.
The third model is a long-running application that is shared by different users. Such an application often acts in some kind of coordination role. For example, Apache Slider has a long-running application master for launching other applications on the cluster. This approach is also used by Impala (see SQL-on-Hadoop Alternatives) to provide a proxy application that the Impala daemons communicate with to request cluster resources. The “always on” application master means that users have very low-latency responses to their queries since the overhead of starting a new application master is avoided.