BIG
DATA

JAVA

Apache HBase Introduction

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 
Apache HBase logo
Developer(s) Apache Software Foundation
Written in Java
Operating system Cross-platform
License Apache License 2.0
Website hbase.apache.org

Why HBase?

There are many use cases for which the RDBMS makes perfect sense and for ages RDBMS is the go to solution for data storage related problems. After the advent of big data, companies realized the benefit of processing big data and started opting for solutions like Hadoop. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured.

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially.Hence not good for record loopkup, not good for updates, not good for incremental addition of small batches. At this point, a new solution is needed to access any point of data in a single unit of time (random access).

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. HBase is a real time, open source, column oriented, distributed database written in Java. HBase is modelled after Google’s BigTable and represents a key value column family store. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. It is built on top of Apache Hadoop and Zookeeper.

HBase on HDFS

Apache HBase runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. HBase allows you to query for individual records as well as derive aggregate analytic reports across a massive amount of data.

HBase Features

Some of the important features of Apache HBase are:

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
  • Automatic failover support between RegionServers.
  • HBase supports HDFS out of the box as its distributed file system.
  • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Extensible jruby-based (JIRB) shell
  • Replication across the data center.
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

We will try to understand these features in depth later on.

What is Sharding?
Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

When to use HBase?

HBase is useful only if:

  • You have high volume of data to store.
  • You want to have high scalability.
  • You can live without all the extra features that an RDBMS provides like typed columns, secondary indexes, transactions, advanced query languages, etc.
  • You have lots of versioned data and you want to store all of them.
  • Want to have column-oriented data.

When Not to use HBase?

HBase is not useful if:

  • You only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
  • You cannot live without RDBMS commands.
  • You have hardware less than 5 Data Nodes when replication factor is 3.

HBase History

  • 2006: Google releases paper on BigTable
  • 2006 (end of year): HBase development starts.
  • 2008: Hadoop became Apache top-level project and HBase became its subproject.
  • 2008: HBase 0.18, 0.19 released
  • 2009: HBae 0.20.0 released
  • 2010: HBase becomes Apache top-level project.
  • 2011: HBase 0.92 was released
  • 2015: HBase 1.0 was released
  • 2016: HBase 1.2.x series is released

Difference between HBase and RDBMS

HBase RDBMS
Column oriented Row-oriented (mostly)
Flexible schema, columns can be added on the fly Fixed schema
Designed to store Denormalized data Designed to store Normalized data
Good with sparse tables Not optimized for sparse tables
Joins using MapReduce which is not optimized Optimized for joins
Tight integration with MapReduce No integration with MapReduce
Horizontal scalability – just add hardware Hard to shard and scale
Good for semi-structured data as well as structured data Good for structured data

Difference between HBase and HDFS

HDFS:

  • Is a distributed file system that is well suited for the storage of large files. It does not provide fast individual record lookups in files.
  • You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
  • HDFS files are write once files. Follows write-once read-many ideology.
  • Is suited for High Latency operations batch processing.
  • Is designed for batch processing and hence doesn’t have a concept of random reads/writes.
  • Data is primarily accessed through MapReduce.

HBase:

  • Is a database that stores it's data in a distributed filesystem.
  • Is built on top of HDFS and provides fast record lookups (and updates) for large tables. HBase internally puts the data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
  • HBASE stores its files in HDFS.
  • Is built for Low Latency operations
  • Provides access to single rows from billions of records
  • Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift