BIG
DATA

JAVA

Apache HBase Architecture

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

Apache HBase Architecture

HBase is a distributed database, meaning it is designed to run on a cluster of few to possibly thousands of servers. As a result it is more complicated to install. All the typical problems of distributed computing begin to come into play such as coordination and management of remote processes, locking, data distribution, network latency and number of round trips between servers. Fortunately HBase makes use of several other mature technologies, such as Apache Hadoop and Apache ZooKeeper, to solve many of these issues.

HBase is a column-oriented data store, meaning it stores data by columns rather than by rows. In HBase if there is no data for a given column family, it simply does not store anything at all; contrast this with a relational database which must store null values explicitly. In addition, when retrieving data in HBase, you should only ask for the specific column families you need; because there can literally be millions of columns in a given row, you need to make sure you ask only for the data you actually need.

hbase architecture

Like HDFS, HBase architecture follows the traditional master slave model where you have a master which takes decisions and one or more slaves which does the real task. In HBase, the master is called HMaster and slaves are called HRegionServers

Apache HBase Architectural Components

As you can see from the above diagram, typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server contains multiple Regions called HRegions. Also HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster.

Data in HBase is stored in Tables and these Tables are stored in Regions. When a Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across the cluster. Each Region Server hosts roughly the same number of Regions.

Lets see each component in more detail.

HBase HMaster

HMaster is the implementation of the Master Server. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the NameNode.

hbase master

PC:wwww.mapred.com

The HMaster in the HBase is responsible for:

  • It performs administration
  • It manages and monitors the cluster
  • It coordinates the region servers by
    • Assigning regions on startup , re-assigning regions for recovery or load balancing
    • Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)
  • It controls the Load Balancing and Failover
  • Its an interface for creating, deleting, updating tables
  • It interacts with the external world (Hmaster web site, client, Region Servers and other management utilities like JConsole).
  • It uses zookeeper to keep track of certain events and happenings in the cluster. In Master, a centralized class called ZookeeperWatcher acts as a proxy for any event tracker which uses zookeeper. All the common things like connection handling, node management and exceptions are handled here. Any tracker which needs the service of this call must register with this class to get notified of any specific event.
Chores done by HMaster
  • Log Cleaner Chore: Hmaster runs a chore to delete the Hlogs in the oldlogs directory.
  • HFile Cleaner Chore: There is also a chore which runs at some specified intervals which handles the HFile cleaning functions inside the master.
  • Balancer Chore: The balancer is a tool that balances disk space usage on an HDFS cluster when some datanodes become full or when new empty nodes join the cluster.
  • Catalog Janitor Chore: A janitor for catalog tables. It scans the META tables looking for unused regions to garbage collect.

HBase RegionServer

HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.

HBase Tables are divided horizontally by row key range into Regions. A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called Region Servers, and these serve data for reads and writes. A region server can serve about 1,000 regions.

The HRegionServer perform the following task:

  • Hosting and managing Regions
  • Splitting the Regions automatically
  • Handling of read and write requests for all the regions under it.
  • Communicating with the Clients directly
  • The RegionServer runs a variety of background threads:
    • Checks for splits and handle minor compactions.
    • Checks for major compactions.
    • Periodically flushes in-memory writes in the MemStore to StoreFiles.
    • Periodically checks the RegionServer’s WAL.
hbase region server

PC:wwww.mapred.com

A Region Server runs on an HDFS data node and has the following components:

  • WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.
  • BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
  • MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
  • Hfiles: Hfiles store the rows as sorted KeyValues on disk.

When Region Server (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles.

Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data from HBase, the clients read the required Region information from the .META table and directly communicate with the appropriate Region Server.

Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to the MemStore for the affected Store. This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer.

HRegions

Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. Regions are nothing but tables that are split up and spread across the region servers. The heirarchy of objects is as follows:

Table	(HBase table)
  Region	(Regions for the table)
    Store	(Store per ColumnFamily for each Region for the table)
      MemStore	(MemStore for each Store for each Region for the table)
      StoreFile	(StoreFiles for each Store for each Region for the table)
        Block	(Blocks within a StoreFile within a Store for each Region for the table)
Table (HBase table)
→Region	(Regions for the table)
→→Store	(Store per ColumnFamily for each Region for the table)
→→→MemStore	(MemStore for each Store for each Region for the table)
→→→→StoreFile	(StoreFiles for each Store for each Region for the table)
→→→→→Block	(Blocks within a StoreFile within a Store for each Region for the table)

Region Split

Initially there is one region per table. When a region grows too large say larger than hbase.hregion.max.filesize, it splits into two child regions. Both child regions, representing one-half of the original region, are opened in parallel on the same Region server, and then the split is reported to the HMaster.

For load balancing reasons, the HMaster may schedule for new regions to be moved off to other servers. This results in the new Region server serving data from a remote HDFS node until a major compaction moves the data files to the Regions server’s local node. HBase data is local when it is written, but when a region is moved (for load balancing or recovery), it is not local until major compaction.

Hbase Minor and Major compaction

When something is written to HBase, it is first written to an in-memory store, called the MemStore. When the MemStore reaches a certain size, it is flushed to disk into a StoreFile. The store files that are created on disk are immutable. Compaction is a process where store files are merged together.

Compaction, comes in two flavors: major and minor. Minor compactions will usually pick up a couple of the smaller adjacent StoreFiles and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. You can tune the number of HFiles to compact and the frequency of a minor compaction. Minor compactions are important because without them, reading a particular row can require many disk reads and cause slow overall performance.

Sometimes a minor compaction will pick up all the StoreFiles in the Store and writes to a single Store file and in this case it actually promotes itself to being a major compaction. Major compactions will usually have to be done manually on large systems.

What is Write Ahead Log (WAL) ?

The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. Under normal operations, the WAL is not needed because data changes info move from the MemStore to StoreFiles. However, if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed. If writing to the WAL fails, the entire operation to modify the data fails. Usually, there is only one instance of a WAL per RegionServer. The RegionServer records Puts and Deletes to it, before recording them to the MemStore for the affected store. The WAL resides in HDFS in the /hbase/WALs/ directory (prior to HBase 0.94, they were stored in /hbase/.logs/), with subdirectories per region.

With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck. HBase 1.0 introduces support of MultiWal. MultiWAL allows a RegionServer to write multiple WAL streams in parallel, by using multiple pipelines in the underlying HDFS instance, which increases total throughput during writes. This parallelization is done by partitioning incoming edits by their Region.

HLog : The class which implements the WAL is called HLog.

Store

A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.

  • MemStore
    The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.
  • StoreFile (HFile)
    StoreFiles are where your data lives.
  • Blocks
    StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression happens at the block level within StoreFiles.

HBase MemStore

  • The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile.
  • It is a write buffer where HBase accumulates data in memory before a permanent write.
  • There is one MemStore per column family. The updates are sorted per column family.
  • When the MemStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS.
  • As HFiles are immutable, it doesn't write to an existing HFile but instead forms a new file on every flush.
  • Size of the MemStore can be changed by changing value of hbase.hregion.memstore.flush.size in hbase-site.xml.

HBase HFile

Data is stored in an HFile which contains sorted key/values. When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head. Key value pairs are stored in increasing order in HFile

HBase and ZooKeeper

HBase comes integrated with Zookeeper. When You start HBase, Zookeeper instance is also started. The reason is that the Zookeeper helps us in keeping a track of all region servers that are there for HBase. Zookeeper keeps track of how many region servers are there, which region servers are holding from which data node to which data node. It keeps track of smaller data sets where Hadoop is missing out. It decreases the overhead on top of Hadoop which keeps track of most of your Meta data. Hence HMaster gets the details of region servers by actually contacting Zookeeper.

You can also start HBase without inbuilt ZooKeeper. To point HBase at an existing ZooKeeper cluster, one that is not managed by HBase, set HBASE_MANAGES_ZK in conf/hbase-env.sh to false and next set ensemble locations and client port, if non-standard, in hbase-site.xml.