Apache Hadoop 3 - Features and Enhancements

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
Apache Hadoop 3.0.0

Hadoop 3 release dates

Hadoop 3.0.0 has been released! Apache Hadoop 3.x incorporates a number of significant enhancements over the previous major release line (hadoop-2.x). After four alpha releases and one beta release, 3.0.0 is generally available from 13 December 2017. For a complete list of changes Click here.

The 3.0.x release series will continue to be maintained, and 3.1.0 is planned for the first half of 2018. Click here to download Hadoop 3.0.0 from the official site.

3.0.0 GA2017-12-13

What’s New in Hadoop 3.0?

1. Minimum required Java version increased from Java 7 to Java 8

With Oracle JDK 7 coming to its end of life in 2015, in Hadoop 3, all Hadoop JARs are compiled to run on JDK 8 version. So, users who are still using Java 7 or below have to upgrade to Java 8 when they will start working with Hadoop 3.

2. Support for erasure encoding in HDFS

Erasure Coding

Erasure Coding is more like an advanced RAID technique that recovers data automatically when hard disk fails. Erasure coding stores the data and provide fault tolerance with less space overhead as compared to HDFS replication. Erasure Coding (EC) can be used in place of replication, which will provide the same level of fault-tolerance with less storage overhead.

With support for erasure coding in Hadoop 3.0, the physical disk usage will be cut by half (i.e. 3x disk space consumption will reduce to 1.5x) and the fault tolerance level will increase by 50%. This new Hadoop 3.0 feature will save hadoop customers big bucks on hardware infrastructure as they can reduce the size of their hadoop cluster to half and store the same amount of data or continue to use the current hadoop cluster hardware infrastructure and store double the amount of data with HDFS Erasure Coding.

Considering trends in data growth and datacenter hardware, HDFS Erasure Coding is a major new feature and an important one.

3. YARN Timeline Service v.2

YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

Version 1 is limited to a single instance of writer/reader and storage, and does not scale well beyond small clusters. V.2 uses a more scalable distributed writer architecture and a scalable backend storage. YARN Timeline Service v.2 separates the collection (writes) of data from serving (reads) of data. It uses distributed collectors, essentially one collector for each YARN application. The readers are separate instances that are dedicated to serving queries via REST API.

Usability improvements
In many cases, users are interested in information at the level of “flows” or logical groups of YARN applications. It is much more common to launch a set or series of YARN applications to complete a logical application. Timeline Service v.2 supports the notion of flows explicitly. In addition, it supports aggregating metrics at the flow level.

4. Shell script rewrite

Much of Apache Hadoop’s functionality is controlled via the shell. The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features.

5. Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

So in Hadoop 3, we have new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. hadoop-client-api is compile scope & hadoop-client-runtime is runtime scope, which contains relocated third party dependencies from hadoop-client. So, that you can bundle the dependencies into a jar and test the whole jar for version conflicts. This avoids leaking Hadoop’s dependencies onto the application’s classpath.

6. Support for Opportunistic Containers and Distributed Scheduling

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

7. MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more. The basic idea is that, add a NativeMapOutputCollector to handle key value pairs emitted by mapper, therefore sort, spill, IFile serialization can all be done in native code.

8. Support for more than 2 NameNodes

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

9. Default ports of multiple services have been changed

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application. Thus the conflicting ports with ephemeral range have been moved out of that range and given new port numbers.

New Ports:

  • Namenode ports:50470 ➔ 9871, 50070 ➔ 9870, 8020 ➔ 9820
  • Secondary NN ports:50091 ➔ 9869, 50090 ➔ 9868
  • Datanode ports:50020 ➔ 9867, 50010 ➔ 9866, 50475 ➔ 9865, 50075 ➔ 9864

10. Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

11. Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI.

12. Reworked daemon and task heap management

Hadoop 3 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated.

Hadoop 3 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change.