BIG
DATA

JAVA

Apache HBase Istallation

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

HBase insatallation is as easy as anything you have ever installed before, because all you have to do is download the most recent release of HBase from the Apache HBase release page and unpack the contents into a suitable directory. Before installing lets see the different HBase run modes as installation steps depends on it.

HBase run modes

HBase has two run modes:

  • Standalone
  • Distributed

Out of the box, HBase runs in standalone mode. To set up a distributed deploy, you will need to configure HBase by editing files in the HBase conf directory.

Standalone mode

This is the default mode. In standalone mode, HBase does not use HDFS, it uses the local filesystem instead and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. Zookeeper binds to a well known port so clients may talk to HBase.

Distributed mode

Distributed modes require an instance of the Hadoop Distributed File System (HDFS). Distributed mode can be subdivided into

  • Pseudo-distributed : distributed but all daemons run on a single node
  • Fully-distributed : where the daemons are spread across all nodes in the cluster.

Requirements

Java

You need Java for HBase and not just any version of Java, but version 6 or later. You also should make sure the java binary is executable and can be found on your path. Try entering java -version on the command line and verify that it works and that it prints out the version number indicating it is version 1.6 or later. To install and set up java click here : Installing Java

Hadoop

After installing java, you have to install Hadoop. To run HBase in Distributed modes require an instance of the Hadoop Distributed File System (HDFS). HBase is bound to work only with the specific version of Hadoop it was built against. One of the reasons for this behavior concerns the remote procedure call (RPC) API between HBase and Hadoop. The wire protocol is versioned and needs to match up; even small differences can cause a broken communication between them.

SSH

ssh must be installed and sshd must be running if you want to use the supplied scripts to manage remote Hadoop and HBase daemons. You must be able to ssh to all nodes, including your local node, using passwordless login.

Synchronized time

The clocks on cluster nodes should be in basic alignment. Even slight differences in time can cause unexplainable behavior. Run NTP on your cluster, or an equivalent application, to synchronize the time on all servers.

Installation

Standalone HBase

Standalone HBase mode is not an appropriate mode for a production instance of HBase, but will allow you to experiment with HBase. Using HBase with a local filesystem does not guarantee durability. You need to run HBase on HDFS to ensure all writes are preserved. Running against the local filesystem is intended as a shortcut to get you familiar with how the general system works, as the very first phase of evaluation.

Standalone HBase insatallation is as easy as anything you have ever installed. All you have to do is download the most recent release of HBase from the Apache HBase release page and unpack the contents into a suitable directory, such as /usr/local or /opt, like so:

$ cd /usr/local
$ tar -zxvf hbase-x.y.z.tar.gz

For HBase 0.98.5 and later, you are required to set the JAVA_HOME environment variable before starting HBase. You can set the variable via your operating system’s usual mechanism, but HBase provides a central mechanism, conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system.

Edit conf/hbase-site.xml, which is the main HBase configuration file. At this time, you only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. The following configuration will store HBase’s data in the hbase directory, in the home directory of the user called testuser. Paste the <property> tags beneath the <configuration> tags, which should be empty in a new HBase install.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/testuser/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/testuser/zookeeper</value>
  </property>
</configuration>

That's all, the HBase installation and configuration is complete. You can start HBase by using start-hbase.sh script provided in the bin folder of HBase. If all goes well, a message is logged to standard output showing that HBase started successfully. You can use the jps command to verify that you have one running process called HMaster. In standalone mode HBase runs all daemons within this single JVM, i.e. the HMaster, a single HRegionServer, and the ZooKeeper daemon.

$cd /usr/local/HBase/bin
$./start-hbase.sh

In the same way that the start script is provided to conveniently start all HBase daemons, the bin/stop-hbase.sh script stops them.

$ ./bin/stop-hbase.sh
stopping hbase....................
$

Pseudo-Distributed Mode

After configuring for standalone, you can re-configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and Zookeeper) runs as a separate process.

In Pseudo-Distributed Mode, you store your data in HDFS instead, assuming you have HDFS available.

If you have configured standalone HBase mode and if HBase is still running, stop it. Pseudo-Distributed Mode will create a totally new directory where HBase will store its data, so any databases you created before will be lost. Now edit the hbase-site.xml configuration. Add the below property, which directs HBase to run in distributed mode, with one JVM instance per daemon.

lt;property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>

Next, change the hbase.rootdir from the local filesystem to the address of your HDFS instance, using the hdfs://// URI syntax. In this example, HDFS is running on the localhost at port 8020.

Note:You do not need to create the directory in HDFS. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want.

Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running. If everything worked correctly, HBase created its directory in HDFS. In the configuration above, it is stored in /hbase/ on HDFS. You can use the hadoop fs command in Hadoop’s bin/ directory to list this directory.

$ ./bin/hadoop fs -ls /hbase
Found 7 items
drwxr-xr-x   - hbase users          0 2015-11-21 11:18 /hbase/.tmp
drwxr-xr-x   - hbase users          0 2015-11-21 11:27 /hbase/WALs
drwxr-xr-x   - hbase users          0 2015-11-21 11:31 /hbase/corrupt
drwxr-xr-x   - hbase users          0 2015-11-21 12:44 /hbase/data
-rw-r--r--   3 hbase users         42 2015-11-21 11:41 /hbase/hbase.id
-rw-r--r--   3 hbase users          7 2015-11-21 11:41 /hbase/hbase.version
drwxr-xr-x   - hbase users          0 2015-11-21 12:18 /hbase/oldWALs

In the same way that the start script is provided to conveniently start all HBase daemons, the bin/stop-hbase.sh script stops them.

$ ./bin/stop-hbase.sh
stopping hbase....................
$

To stop HBase, you can stop in a same way as in the standalone mode, using the bin/stop-hbase.sh command.

Fully-Distributed Mode

In real-world scenarios, you need a fully-distributed configuration to fully test HBase. In a distributed configuration, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master instances, multiple Zookeeper nodes, and multiple RegionServer nodes.

Lets add 2 more nodes for the above Pseudo-Distributed Mode installation to make it Fully-Distributed Mode for the explanation sake. In reality the number of nodes will vary from dozen to hundreds of nodes.

The example architecture will be as follows:

Node Name Master ZooKeeper RegionServer

node-a.corejavaguru.com

yes

yes

no

node-b.corejavaguru.com

backup

yes

yes

node-c.corejavaguru.com

no

yes

yes

Note: The below steps assumes that each node is a virtual machine and that they are all on the same network. It builds upon the Pseudo-Distributed Local Install, assuming that the system you configured in that procedure is now node-a. Stop HBase on node-a before continuing.

SSH Access

node-a needs to be able to log into node-b and node-c (and to itself) in order to start the daemons. The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from node-a to each of the others.

  • On node-a, generate a key pair. While logged in as the user who will run HBase, generate a SSH key pair, using the following command:
    $ ssh-keygen -t rsa
    
  • On node-b and node-c, log in as the HBase user and create a .ssh/ directory in the user’s home directory, if it does not already exist.
  • Securely copy the public key from node-a to each of the nodes, by using the scp or some other secure means. On each of the other nodes, create a new file called .ssh/authorized_keys if it does not already exist, and append the contents of the id_rsa.pub file to the end of it. Note that you also need to do this for node-a itself.
    $ cat id_rsa.pub >> ~/.ssh/authorized_keys
    
  • If you performed the procedure correctly, if you SSH from node-a to either of the other nodes, using the same username, you should not be prompted for a password.

node-a configuration

  • Edit conf/regionservers and remove the line which contains localhost. Add lines with the hostnames or IP addresses for node-b and node-c.
  • To configure ZooKeeper edit conf/hbase-site.xml and add the following properties.
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>node-a.corejavaguru.com,node-b.corejavaguru.com,node-c.corejavaguru.com</value>
    </property>
    <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/usr/local/zookeeper</value>
    </property>
    

node-b and node-c configuration
Copy the configuration files from node-a to node-b.and node-c. Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/ directory on node-b and node-c.

Start HBase
On node-a, issue the start-hbase.sh command. Your output will be similar to that below.

$ bin/start-hbase.sh
node-c.corejavaguru.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.corejavaguru.com.out
node-a.corejavaguru.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.corejavaguru.com.out
node-b.corejavaguru.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.corejavaguru.com.out
starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-node-a.corejavaguru.com.out
node-c.corejavaguru.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.corejavaguru.com.out
node-b.corejavaguru.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.corejavaguru.com.out
node-b.corejavaguru.com: starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-nodeb.corejavaguru.com.out

ZooKeeper starts first, followed by the master, then the RegionServers.

On each node of the cluster, run the jps command and verify that the correct processes are running on each server.

node-a jps Output

$ jps
20355 Jps
20071 HQuorumPeer
20137 HMaster

node-b jps Output

$ jps
15930 HRegionServer
16194 Jps
15838 HQuorumPeer

node-c jps Output

$ jps
13901 Jps
13639 HQuorumPeer
13737 HRegionServer

Note: The HQuorumPeer process is a ZooKeeper instance which is controlled and started by HBase. If you use ZooKeeper this way, it is limited to one instance per cluster node, , and is appropriate for testing only. If ZooKeeper is run outside of HBase, the process is called QuorumPeer.

HBase Web UI

If everything is set up correctly, you should be able to connect to the UI for the Master http://node-a.corejavaguru.com:16010/ , using a web browser. You can see the web UI for each of the RegionServers at port 16030 of their IP addresses, or by clicking their links in the web UI for the Master.

Note: In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16010 for the Master and 16030 for the RegionServer.