BIG
DATA

JAVA

Apache YARN Application Run

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

A YARN application implements a specific function that runs on Hadoop. MapReduce is an example of a YARN application. A YARN application involves three components—the client, the ApplicationMaster(AM), and the container

Hadoop Yarn Application

Launching a new YARN application starts with a YARN client communicating with the ResourceManager to create a new YARN ApplicationMaster instance. Part of this process involves the YARN client informing the ResourceManager of the Application-Master’s physical resource requirements.

The ApplicationMaster is the master process of a YARN application. It doesn’t perform any application-specific work, as these functions are delegated to the containers. Instead, it’s responsible for managing the application-specific containers: asking the ResourceManager of its intent to create containers and then liaising with the Node-Manager to actually perform the container creation.

As part of this process, the ApplicationMaster must specify the resources that each container requires in terms of which host should launch the container and what the container’s memory and CPU requirements are. The ability of the ResourceManager to schedule work based on exact resource requirements is a key to YARN’s flexibility, and it enables hosts to run a mix of containers.

The ApplicationMaster is also responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.

YARN Application Startup

With the knowledge of the above concepts, lets see how applications conceptually work in YARN.

Application execution consists of the following steps:

  • Application submission.
  • Bootstrapping the ApplicationMaster instance for the application.
  • Application execution managed by the ApplicationMaster instance.
Hadoop Yarn Application Startup

Let’s walk through an application execution sequence (steps are illustrated in the diagram):

  • A client program submits the application, including the necessary specifications to launch the application-specific ApplicationMaster itself.
  • The ResourceManager assumes the responsibility to negotiate a specified container in which to start the ApplicationMaster and then launches the ApplicationMaster.
  • The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the client program to query the ResourceManager for details, which allow it to directly communicate with its own ApplicationMaster.
  • During normal operation the ApplicationMaster negotiates appropriate resource containers via the resource-request protocol.
  • On successful container allocations, the ApplicationMaster launches the container by providing the container launch specification to the NodeManager. The launch specification, typically, includes the necessary information to allow the container to communicate with the ApplicationMaster itself.
  • The application code executing within the container then provides necessary information (progress, status etc.) to its ApplicationMaster via an application-specific protocol.
  • During the application execution, the client that submitted the program communicates directly with the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.
  • Once the application is complete, and all necessary work has been finished, the ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.

YARN Application Lifespan

The lifespan of a YARN application can vary dramatically: from a short-lived application of a few seconds to a long-running application that runs for days or even months. Rather than look at how long the application runs for, it’s useful to categorize applications in terms of how they map to the jobs that users run. The simplest case is one application per user job, which is the approach that MapReduce takes.

The second model is to run one application per workflow or user session of (possibly unrelated) jobs. This approach can be more efficient than the first, since containers can be reused between jobs, and there is also the potential to cache intermediate data between jobs. Spark is an example that uses this model.

The third model is a long-running application that is shared by different users. Such an application often acts in some kind of coordination role. For example, Apache Slider has a long-running application master for launching other applications on the cluster. This approach is also used by Impala (see SQL-on-Hadoop Alternatives) to provide a proxy application that the Impala daemons communicate with to request cluster resources. The “always on” application master means that users have very low-latency responses to their queries since the overhead of starting a new application master is avoided.