Introduction to Apache HBase


Developer(s)	Apache Software Foundation
Written in	Java
Operating system	Cross-platform
License	Apache License 2.0
Website	hbase.apache.org

Why HBase?

There are many use cases for which the RDBMS makes perfect sense and for ages RDBMS is the go to solution for data storage related problems. But today, we live in an era in which we are all connected over the Internet and expect to find results instantaneously. Since after the advent of big data, companies have become focused on delivering more targeted information, such as recommendations or online ads, and their ability to do so directly influences their success as a business. Systems like Hadoop now enable them to gather and process petabytes of data. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels at storing and processing of huge data of arbitrary, semi-, or even unstructured formats.

Hadoop is a framework for storing, processing and managing large amounts of data. It has the following features:

Resource management
Distributed file system HDFS
Runs on commodity Hardware
Fault tolerance
Large scale batch processing with MapReduce

So hadoop is good for batch processing, scans over big files. It accesses data in a sequential manner. Usually in hadoop A huge dataset when processed results in another huge data set. Out of the box Hadoop can handle a high volume of multi-structured data. But it can not handle a high velocity of random reads and writes and it is unable to change a file without completely rewriting it. Hence not good for record loopkup, not good for updates, not good for incremental addition of small batches.

The solution to this problem is the Log-Structured Merge Tree (LSM Tree). The HBase data structure is based on LSM Trees. HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. HBase allows fast random reads and writes. Although HBase allows fast random writes, it is read optimized.

Similar to HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. HBase is a real time, open source, column oriented, distributed database written in Java. HBase is modelled after Google’s BigTable and represents a key value column family store. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. It is built on top of Apache Hadoop and Zookeeper.

Apache HBase runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. HBase allows you to query for individual records as well as derive aggregate analytic reports across a massive amount of data.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

HBase Features

Some of the important features of Apache HBase are:

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
Automatic failover support between RegionServers.
HBase supports HDFS out of the box as its distributed file system.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Replication across the data center.
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

We will try to understand these features in depth later on.

When to use HBase?

HBase is useful only if:

You have high volume of data to store.
You want to have high scalability.
You can live without all the extra features that an RDBMS provides like typed columns, secondary indexes, transactions, advanced query languages, etc.
You have lots of versioned data and you want to store all of them.
Want to have column-oriented data.

When Not to use HBase?

HBase is not useful if:

You only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
You cannot live without RDBMS commands.
You have hardware less than 5 Data Nodes when replication factor is 3.

HBase History

Google was faced with a challenging problem of how could it provide timely search results across the entire Internet? The answer was that it essentially needed to cache the Internet and define a new way to search that enormous cache quickly.

In 2003, Google published a paper titled “The Google File System”. This scalable distributed file system, abbreviated as GFS, uses a cluster of commodity hardware to store huge amounts of data. The filesystem handled data replication between nodes so that losing a storage server would have no effect on data availability.

Shortly afterward, Google published another paper, titled “MapReduce: Simplified Data Processing on Large Clusters”. MapReduce is a programming model and an associated implementation for processing and generating large data sets. MapReduce made use of the vast number of CPUs each commodity server in the GFS cluster provides. MapReduce plus GFS forms the backbone for processing massive amounts of data, including the entire search index Google owns.

Because of the shortcomings of RDBMSes at large scale data, defined a new simple API that has basic create, read, update, and delete (or CRUD) operations, plus a scan function to iterate over larger key ranges or entire tables. All these efforts was published in 2006 in a paper titled “Bigtable: A Distributed Storage System for Structured Data”. BigTable is designed to scale to a large size: petabytes of data across thousands of commodity servers

It was not too long after Google published these documents that we started seeing open source implementations of them, and in 2007, Mike Cafarella released code for an open source BigTable implementation that he called HBase. In 2008, Hadoop became Apache top-level project and HBase became its subproject. Also ,HBase 0.18, 0.19 released in October 2008. In 2010, HBase becomes Apache top-level project. HBase 0.92 was released in 2011. The latest release as of wiriting this article is 0.96. You can download the latest from here.

Relational Database Systems and its problems

Whenever you have to retain information about anything, you typically want to use some storage backend providing a persistence layer for your application. This works well for a limited number of records, but with the dramatic increase of data being retained nowadys, some of the architectural implementation details of common database systems show signs of weakness.

Relational databases perform transaction update functions very well, particularly handling the difficult issues of consistency during update. Relational databases apply much of the same overhead required for complex update operations to every activity, and that can handicap them for other functions. Relational databases struggle with the efficiency of certain operations key to Big Data management. They don’t scale well to very large sizes, and although grid solutions can help with this problem, the creation of new clusters on the grid is not dynamic and large data solutions become very expensive using relational databases. They don’t do unstructured data search very well nor do they handle data in unexpected formats well.

Despite the maturity of relational database products and the dramatic growth in computer power over the past decade, we still hear about projects that fail because the performance of the relational database used is just not good enough. Usually this is because of the way relational databases physically store data. For developers to assemble the data that they need, they often have to do multiple JOINs of one table to another to another to another. To retrieve the data, the database runs optimization routines to determine the best way to gather the data and then retrieves it. This process often takes a long time and can negatively impact performance. While relational database optimizers have improved over time, they still present performance overhead that is significantly greater.

Big Data organizations such as Yahoo, Google, Facebook and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with, hence the development of the Hadoop file system, the MapReduce programming language, and associated databases such as Cassandra and HBase. One of the key capabilities of a Hadoop type environment is the ability to dynamically, or at least easily, expand the number of servers being used for data storage. The cost of storing large amounts of data in a relational database gets very expensive, where cost grows geometrically with the amount of data to be stored, reaching a limit in the petabyte range. The cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit.

But still, a lot of companies are using RDBMSes successfully as part of their technology stack. For example, Facebook, Google has a very large MySQL setup, and for its purposes it works sufficiently. This database farm suits the given business goal and may not be replaced anytime soon.

NoSQL

HBase is a type of "NoSQL" database. NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. For decades, the relational database (RDBMS) has been the dominant model for database management. But, today, non-relational, "cloud," or "NoSQL" databases are gaining interests as an alternative model for database management.

NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud.

NoSQL databases are sometimes referred to as cloud databases, non-relational databases, Big Data databases and a myriad of other terms and were developed in response to the sheer volume of data being generated, stored and analyzed by modern day applications.

NoSQL Database Types

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Examples include: MongoDB and CouchDB.
Graph database are used to store information about networks of data, such as social connections. Examples include: Neo4j and Titan.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of this type of database include: Cassandra, DyanmoDB, Azure Table Storage (ATS), Riak, BerkeleyDB.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. Examples include: HBase, BigTable and HyperTable.

Why NoSQL?

The reasons for businesses to adopt a NoSQL database environment over a relational database have almost everything to do with Big Data.Big Data is one of the key forces driving the growth and popularity of NoSQL for business. The almost limitless array of data collection technologies ranging from simple online actions to point of sale systems to GPS tools to smartphones and tablets to sophisticated sensors and many more act as force multipliers for data growth.

Some of the advantages of NoSQL are:

In today’s marketplace, where the competition is just a click away, downtime can be deadly to a company’s bottom line and reputation. Hardware failures can and will occur, fortunately NoSQL database environments are built with a distributed architecture so there are no single points of failure and there is built-in redundancy of both function and data.
Geographically distributed scale-out architecture instead of expensive, monolithic architecture
NoSQL databases are built to allow the insertion of data without a predefined schema. That makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed.
NoSQL databases, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.

Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing virtually unlimited capacity on demand, and taking care of all the necessary infrastructure administration tasks. Developers no longer need to construct complex, expensive platforms to support their applications, and can concentrate on writing application code.
Most NoSQL databases also support automatic database replication to maintain availability in the event of outages or planned maintenance events.

Note : If you want to see full list of NoSQL databases, click here - http://nosql-database.org/

Difference between HBase and RDBMS

HBase	RDBMS
Column oriented	Row-oriented (mostly)
Flexible schema, columns can be added on the fly	Fixed schema
Designed to store Denormalized data	Designed to store Normalized data
Good with sparse tables	Not optimized for sparse tables
Joins using MapReduce which is not optimized	Optimized for joins
Tight integration with MapReduce	No integration with MapReduce
Horizontal scalability – just add hardware	Hard to shard and scale
Good for semi-structured data as well as structured data	Good for structured data

Difference between HBase and HDFS

HDFS:

Is a distributed file system that is well suited for the storage of large files. It does not provide fast individual record lookups in files.
You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
HDFS files are write once files. Follows write-once read-many ideology.
Is suited for High Latency operations batch processing.
Is designed for batch processing and hence doesn’t have a concept of random reads/writes.
Data is primarily accessed through MapReduce.

HBase:

Is a database that stores it's data in a distributed filesystem.
Is built on top of HDFS and provides fast record lookups (and updates) for large tables. HBase internally puts the data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
HBASE stores its files in HDFS.
Is built for Low Latency operations
Provides access to single rows from billions of records
Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift

Next →
HBase Architecture

BIG
DATA

JAVA

Apache HBase Introduction

Why HBase?

What is HBase?

HBase Features

When to use HBase?

When Not to use HBase?

HBase History

Relational Database Systems and its problems

NoSQL

NoSQL Database Types

Why NoSQL?

Difference between HBase and RDBMS

Difference between HBase and HDFS

BIGDATA

JAVA

Apache HBase Introduction

CoreJavaGuru

Vivek HJ

Why HBase?

What is HBase?

HBase Features

When to use HBase?

When Not to use HBase?

HBase History

Relational Database Systems and its problems

NoSQL

NoSQL Database Types

Why NoSQL?

Difference between HBase and RDBMS

Difference between HBase and HDFS

BIG
DATA