BIG
DATA

JAVA

HADOOP RACK AWARENESS

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

Usually large Hadoop clusters are configured in multiple racks. Communication between two data nodes on the same rack is efficient than the same between two nodes on different racks. In large clusters of Hadoop, in order to improve network traffic while reading/writing HDFS files, NameNode chooses data nodes which are on the same rack or a near by rack to read/write request.

NameNode achieves this rack information by maintaining rack ids of each data node. This concept of choosing closer data nodes based on racks information is called Rack Awareness in Hadoop. A default Hadoop installation assumes all the nodes belong to the same rack.

HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.


Replica placement via Rack Awareness in Hadoop

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks.

For example,

  • When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
  • When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
  • When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
  • For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.

This policy improves write performance without compromising data reliability or read performance.


CONFIGURING RACK AWARENESS IN HADOOP

We are aware of the fact that hadoop replicates the data into multiple file blocks and stores them on different machines. If Rack Awareness is not configured, there may be a possibility that hadoop will place all the copies of the block in same rack which results in loss of data when that rack fails.

Although rare, as rack failure is not as frequent as node failure, this can be avoided by explicitly configuring the Rack Awareness in conf-site.xml.

Rack awareness is configured using the property “topology.script.file.name” in the core-site.xml. If “topology.script.file.name” is not configured, /default-rack is passed for any ip address i.e., all nodes are placed on same rack.

Use the following instructions to configure rack awareness

1. Create a Rack Topology Script

Topology scripts are used by Hadoop to determine the rack location of nodes. Create a topology script and data file.
File name: rack-topology.sh

#!/bin/bash

# Adjust/Add the property "net.topology.script.file.name"
# to core-site.xml with the "absolute" path the this
# file.  ENSURE the file is "executable".

# Supply appropriate rack prefix
RACK_PREFIX=default

# To test, supply a hostname as script input:
if [ $# -gt 0 ]; then

CTL_FILE=${CTL_FILE:-"rack_topology.data"}

HADOOP_CONF=${HADOOP_CONF:-"/etc/hadoop/conf"} 

if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
  echo -n "/$RACK_PREFIX/rack "
  exit 0
fi

while [ $# -gt 0 ] ; do
  nodeArg=$1
  exec< ${HADOOP_CONF}/${CTL_FILE}
  result=""
  while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
      result="${ar[1]}"
    fi
  done
  shift
  if [ -z "$result" ] ; then
    echo -n "/$RACK_PREFIX/rack "
  else
    echo -n "/$RACK_PREFIX/rack_$result "
  fi
done

else
  echo -n "/$RACK_PREFIX/rack "
fi

Sample Topology Data File
File name: rack_topology.data

# This file should be:
#  - Placed in the /etc/hadoop/conf directory
#    - On the Namenode (and backups IE: HA, Failover, etc)
#    - On the Job Tracker OR Resource Manager (and any Failover JT's/RM's) 
# This file should be placed in the /etc/hadoop/conf directory.

# Add Hostnames to this file. Format  
192.168.2.10 01
192.168.2.11 02
192.168.2.12 03

Copy both of these files to the /etc/hadoop/conf directory on all cluster nodes. Then run the rack-topology.sh script.

2. Add Properties to core-site.xml

Stop HDFS. Add the following property to core-site.xml:

<property>
	<name>net.topology.script.file.name</name> 
	<value>/etc/hadoop/conf/rack-awareness.sh</value>
</property>

3. Restart HDFS and MapReduce

Restart HDFS and MapReduce. After the services have started,verify that rack awareness has been activated. You can use the Hadoop fsck and dfsadmin -report command to verify.


Advantages of implementing rack awareness

  • Rack awareness enables Hadoop to maximize network bandwidth by favouring block transfers within a rack over transfers between racks.Specially with rack awareness, the YARN is able to optimize MapReduce job performance by assigning tasks to nodes that are 'closer' to their data in terms of network topology. This is particularly beneficial in cases where tasks cannot be assigned to nodes where their data is stored locally.
  • By default, during HDFS write operations, the namenode assigns the 2nd and 3rd block replicas to nodes in a different rack from the first replica. This provides data protection even against rack failure; however this is possible only if Hadoop was configured with knowledge of its rack configuration.
  • Never loose all data if entire rack fails.

Hence to get maximum performance, it is important to configure Hadoop so that it knows the topology of your network. Network locations such as hosts and racks are represented in a tree, which reflects the network “distance” between locations. HDFS will use the network location to be able to place block replicas more intelligently to trade off performance and resilience.