Apache Hadoop Introduction

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
Apache HBase logo
Developer(s) Apache Software Foundation
Written in Java
Operating system Cross-platform
License Apache License 2.0

Apache Hadoop is an open-source software framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single server to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


Hadoop was first conceived to fix a scalability issue that existed in Nutch, an open source crawler and search engine. At that time Google had published papers about the Google File System (GFS), and Map-Reduce, a computational framework for parallel processing. Development started in the Apache Nutch project with the successful implementation of these papers. But in January 2006 Apache Nutch project was moved to the new Hadoop subproject Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

Apache Hadoop

Hadoop is a distributed master-slave architecture that consists of the following primary components:

  • Hadoop Distributed File System (HDFS) for data storage.
  • Yet Another Resource Negotiator (YARN), a general purpose scheduler and resource manager.
  • MapReduce, a batch-based computational engine. MapReduce is implemented as a YARN application.


HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after the Google File System (GFS) paper.4 HDFS is optimized for high throughput and works best when reading and writing large files (gigabytes and larger). To support this throughput, HDFS uses unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O).

Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. Hadoop 2 introduced two significant new features for HDFS—Federation and High Availability (HA):

  • Federation allows HDFS metadata to be shared across multiple NameNode hosts, which aides with HDFS scalability and also provides data isolation, allowing different applications or teams to run their own NameNodes without fear of impacting other NameNodes on the same cluster.
  • High Availability in HDFS removes the single point of failure that existed in Hadoop 1, wherein a NameNode disaster would result in a cluster outage. HDFS HA also offers the ability for failover (the process by which a standby Name-Node takes over work from a failed primary NameNode) to be automated.


Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.

YARN’s architecture is simple because its primary role is to schedule and manage resources in a Hadoop cluster. The core components in YARN: the ResourceManager and the NodeManager. YARN separates resource management and processing components.

Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc. YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.


MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce. It allows you to parallelize work over a large amount of raw data. The MapReduce model simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software. With this abstraction, MapReduce allows the programmer to focus on addressing business needs rather than getting tangled up in distributed system complications.

Hadoop distributions

Hadoop is an Apache open source project, and regular releases of the software are available for download directly from the Apache project’s website ( You can either download and install Hadoop from the website or use a commercial distribution of Hadoop, which will give you the added benefits of enterprise administration software, a support team to consult.


Apache is the organization that maintains the core Hadoop code and distribution. the challenge with the Apache distributions has been that support is limited to the goodwill of the open source community, and there’s no guarantee that your issue will be investigated and fixed. Having said that, the Hadoop community is a very supportive one, and responses to problems are usually rapid.


CDH (Cloudera Distribution Including Apache Hadoop) is the most tenured Hadoop distribution, and it employs a large number of Hadoop (and Hadoop ecosystem) committers. Doug Cutting, who along with Mike Caferella originally created Hadoop, is the chief architect at Cloudera. In aggregate, this means that bug fixes and feature requests have a better chance of being addressed in Cloudera compared to Hadoop distributions with fewer committers.


Hortonworks Data Platform (HDP) is also made up of a large number of Hadoop committers, and it offers the same advantages as Cloudera in terms of the ability to quickly address problems and feature requests in core Hadoop and its ecosystem projects. Hortonworks is also the main driver behind the next-generation YARN platform, which is a key strategic piece keeping Hadoop relevant.


MapR has fewer Hadoop committers on its team than the other distributions discussed here, so its ability to fix and shape Hadoop’s future is potentially more bounded than its peers.

Who uses Apache Hadoop

On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. In June 2012, they announced the data had grown to 100 PB and later that year they announced that the data was growing by roughly half a PB per day.

As of 2016, Hadoop adoption had become widespread. A wide variety of companies and organizations use Hadoop for both research and production. You can check who uses hadoop here in the Hadoop PoweredBy wiki page.