☰
|
|
| Developer(s) | Apache Software Foundation |
|---|---|
| Written in | Java |
| Operating system | Cross-platform |
| License | Apache License 2.0 |
| Website | hbase.apache.org |
There are many use cases for which the RDBMS makes perfect sense and for ages RDBMS is the go to solution for data storage related problems. But today, we live in an era in which we are all connected over the Internet and expect to find results instantaneously. Since after the advent of big data, companies have become focused on delivering more targeted information, such as recommendations or online ads, and their ability to do so directly influences their success as a business. Systems like Hadoop now enable them to gather and process petabytes of data. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels at storing and processing of huge data of arbitrary, semi-, or even unstructured formats.
Hadoop is a framework for storing, processing and managing large amounts of data. It has the following features:
So hadoop is good for batch processing, scans over big files. It accesses data in a sequential manner. Usually in hadoop A huge dataset when processed results in another huge data set. Out of the box Hadoop can handle a high volume of multi-structured data. But it can not handle a high velocity of random reads and writes and it is unable to change a file without completely rewriting it. Hence not good for record loopkup, not good for updates, not good for incremental addition of small batches.
The solution to this problem is the Log-Structured Merge Tree (LSM Tree). The HBase data structure is based on LSM Trees. HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. HBase allows fast random reads and writes. Although HBase allows fast random writes, it is read optimized.
Similar to HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.
HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. HBase is a real time, open source, column oriented, distributed database written in Java. HBase is modelled after Google’s BigTable and represents a key value column family store. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. It is built on top of Apache Hadoop and Zookeeper.
Apache HBase runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. HBase allows you to query for individual records as well as derive aggregate analytic reports across a massive amount of data.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
Some of the important features of Apache HBase are:
We will try to understand these features in depth later on.
HBase is useful only if:
HBase is not useful if:
Google was faced with a challenging problem of how could it provide timely search results across the entire Internet? The answer was that it essentially needed to cache the Internet and define a new way to search that enormous cache quickly.
In 2003, Google published a paper titled “The Google File System”. This scalable distributed file system, abbreviated as GFS, uses a cluster of commodity hardware to store huge amounts of data. The filesystem handled data replication between nodes so that losing a storage server would have no effect on data availability.
Shortly afterward, Google published another paper, titled “MapReduce: Simplified Data Processing on Large Clusters”. MapReduce is a programming model and an associated implementation for processing and generating large data sets. MapReduce made use of the vast number of CPUs each commodity server in the GFS cluster provides. MapReduce plus GFS forms the backbone for processing massive amounts of data, including the entire search index Google owns.
Because of the shortcomings of RDBMSes at large scale data, defined a new simple API that has basic create, read, update, and delete (or CRUD) operations, plus a scan function to iterate over larger key ranges or entire tables. All these efforts was published in 2006 in a paper titled “Bigtable: A Distributed Storage System for Structured Data”. BigTable is designed to scale to a large size: petabytes of data across thousands of commodity servers
It was not too long after Google published these documents that we started seeing open source implementations of them, and in 2007, Mike Cafarella released code for an open source BigTable implementation that he called HBase. In 2008, Hadoop became Apache top-level project and HBase became its subproject. Also ,HBase 0.18, 0.19 released in October 2008. In 2010, HBase becomes Apache top-level project. HBase 0.92 was released in 2011. The latest release as of wiriting this article is 0.96. You can download the latest from here.
Whenever you have to retain information about anything, you typically want to use some storage backend providing a persistence layer for your application. This works well for a limited number of records, but with the dramatic increase of data being retained nowadys, some of the architectural implementation details of common database systems show signs of weakness.
Relational databases perform transaction update functions very well, particularly handling the difficult issues of consistency during update. Relational databases apply much of the same overhead required for complex update operations to every activity, and that can handicap them for other functions. Relational databases struggle with the efficiency of certain operations key to Big Data management. They don’t scale well to very large sizes, and although grid solutions can help with this problem, the creation of new clusters on the grid is not dynamic and large data solutions become very expensive using relational databases. They don’t do unstructured data search very well nor do they handle data in unexpected formats well.
Despite the maturity of relational database products and the dramatic growth in computer power over the past decade, we still hear about projects that fail because the performance of the relational database used is just not good enough. Usually this is because of the way relational databases physically store data. For developers to assemble the data that they need, they often have to do multiple JOINs of one table to another to another to another. To retrieve the data, the database runs optimization routines to determine the best way to gather the data and then retrieves it. This process often takes a long time and can negatively impact performance. While relational database optimizers have improved over time, they still present performance overhead that is significantly greater.
Big Data organizations such as Yahoo, Google, Facebook and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with, hence the development of the Hadoop file system, the MapReduce programming language, and associated databases such as Cassandra and HBase. One of the key capabilities of a Hadoop type environment is the ability to dynamically, or at least easily, expand the number of servers being used for data storage. The cost of storing large amounts of data in a relational database gets very expensive, where cost grows geometrically with the amount of data to be stored, reaching a limit in the petabyte range. The cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit.
But still, a lot of companies are using RDBMSes successfully as part of their technology stack. For example, Facebook, Google has a very large MySQL setup, and for its purposes it works sufficiently. This database farm suits the given business goal and may not be replaced anytime soon.
HBase is a type of "NoSQL" database. NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. For decades, the relational database (RDBMS) has been the dominant model for database management. But, today, non-relational, "cloud," or "NoSQL" databases are gaining interests as an alternative model for database management.
NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud.
NoSQL databases are sometimes referred to as cloud databases, non-relational databases, Big Data databases and a myriad of other terms and were developed in response to the sheer volume of data being generated, stored and analyzed by modern day applications.
The reasons for businesses to adopt a NoSQL database environment over a relational database have almost everything to do with Big Data.Big Data is one of the key forces driving the growth and popularity of NoSQL for business. The almost limitless array of data collection technologies ranging from simple online actions to point of sale systems to GPS tools to smartphones and tablets to sophisticated sensors and many more act as force multipliers for data growth.
Some of the advantages of NoSQL are:
NoSQL databases, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.
Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing virtually unlimited capacity on demand, and taking care of all the necessary infrastructure administration tasks. Developers no longer need to construct complex, expensive platforms to support their applications, and can concentrate on writing application code.
Note : If you want to see full list of NoSQL databases, click here - http://nosql-database.org/
| HBase | RDBMS |
|---|---|
| Column oriented | Row-oriented (mostly) |
| Flexible schema, columns can be added on the fly | Fixed schema |
| Designed to store Denormalized data | Designed to store Normalized data |
| Good with sparse tables | Not optimized for sparse tables |
| Joins using MapReduce which is not optimized | Optimized for joins |
| Tight integration with MapReduce | No integration with MapReduce |
| Horizontal scalability – just add hardware | Hard to shard and scale |
| Good for semi-structured data as well as structured data | Good for structured data |
HDFS:
HBase: