BIG
DATA

JAVA

Apache HBase Configurations

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

Configuration Files

All configuration files are located in the conf/ directory, which needs to be kept in sync for each node on your cluster. HBase uses the same configuration system as Apache Hadoop. Following are the different HBase Configuration Files:

  • backup-masters
    Not present by default. A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line.
  • hadoop-metrics2-hbase.properties
    Used to connect HBase Hadoop’s Metrics2 framework. See the Hadoop Wiki entry for more information on Metrics2. Contains only commented-out examples by default.
  • hbase-env.cmd and hbase-env.sh
    Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables. The file contains many commented-out examples to provide guidance.
  • hbase-policy.xml
    The default policy configuration file used by RPC servers to make authorization decisions on client requests. Only used if HBase security is enabled.
  • hbase-site.xml
    The main HBase configuration file. This file specifies configuration options which override HBase’s default configuration. You can view (but do not edit) the default configuration file at docs/hbase-default.xml. You can also view the entire effective configuration for your cluster (defaults and overrides) in the HBase Configuration tab of the HBase Web UI.
  • log4j.properties
    Configuration file for HBase logging via log4j.
  • regionservers
    A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster. By default this file contains the single entry localhost. It should contain a list of hostnames or IP addresses, one per line, and should only contain localhost if each node in your cluster will run a RegionServer on its localhost interface.

Configurations

hbase-site.xml

This file specifies configuration options which override HBase’s default configuration located in the default configuration file at docs/hbase-default.xml. For HBase, site specific customizations go into the file conf/hbase-site.xml. Changes in this file will require a cluster restart for HBase to notice the change. Below are some of the importnat configurations you may add in this file:

  • hbase.tmp.dir
    Temporary directory on the local filesystem. Change this setting to point to a location more permanent than '/tmp', the usual resolve for java.io.tmpdir, as the '/tmp' directory is cleared on machine restart.
  • hbase.rootdir
    The directory shared by region servers and into which HBase persists. The URL should be 'fully-qualified' to include the filesystem scheme. For example, to specify the HDFS directory '/hbase' where the HDFS instance’s namenode is running at namenode.example.org on port 9000, set this value to: hdfs://namenode.example.org:9000/hbase. By default, we write to whatever ${hbase.tmp.dir} is set too usually /tmp so change this configuration or else all data will be lost on machine restart.
  • hbase.regionserver.port
    The port the HBase RegionServer binds to.
  • hbase.regionserver.info.port
    The port for the HBase RegionServer web UI Set to -1 if you do not want the RegionServer UI to run.
    Default:16030
  • hbase.regionserver.handler.count
    Count of RPC Listener instances spun up on RegionServers. Same property is used by the Master for count of master handlers.
    Default:30
  • hbase.regionserver.logroll.period
    Period at which we will roll the commit log regardless of how many edits it has.
    Default:3600000
  • hbase.regionserver.global.memstore.size
    Maximum size of all memstores in a region server before new updates are blocked and flushes are forced. Defaults to 40% of heap (0.4). Updates are blocked and flushes are forced until size of all memstores in a region server hits hbase.regionserver.global.memstore.size.lower.limit. The default value in this configuration has been intentionally left empty in order to honor the old hbase.regionserver.global.memstore.upperLimit property if present.
  • hbase.regionserver.region.split.policy
    A split policy determines when a region should be split. The various other split policies that are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy, DelimitedKeyPrefixRegionSplitPolicy, and KeyPrefixRegionSplitPolicy. DisabledRegionSplitPolicy blocks manual region splitting.
  • hbase.regionserver.regionSplitLimit
    Limit for the number of regions after which no more region splitting should take place. This is not hard limit for the number of regions but acts as a guideline for the regionserver to stop splitting after a certain limit.
    Default:1000
  • hbase.cluster.distributed
    The mode the cluster will be in. Possible values are false for standalone mode and true for distributed mode. If false, startup will run all HBase and ZooKeeper daemons together in the one JVM. which directs HBase to run in distributed mode, with one JVM instance per daemon.
    Default:false
  • dfs.replication
    If for example, you want to run with a replication factor of 5, HBase will create files with the default of 3 unless you do the above to make the configuration available to HBase.
  • hbase.master.port
    The port the HBase Master should bind to.
    Default:16000
  • hbase.master.info.port
    The port for the HBase Master web UI. Set to -1 if you do not want a UI instance run.
    Default:16010
  • zookeeper.session.timeout
    ZooKeeper session timeout in milliseconds. It is used in two different ways. First, this value is used in the ZK client that HBase uses to connect to the ensemble. It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'. See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions. For example, if an HBase region server connects to a ZK ensemble that’s also managed by HBase, then the session timeout will be the one specified by this configuration. But, a region server that connects to an ensemble managed with a different configuration will be subjected that ensemble’s maxSessionTimeout. So, even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and it will take precedence. The current default that ZK ships with is 40 seconds, which is lower than HBase’s.
    Default:90000
  • hbase.zookeeper.quorum
    Comma separated list of servers in the ZooKeeper ensemble (This config. should have been named hbase.zookeeper.ensemble). For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper ensemble servers. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which hbase will start/stop ZooKeeper on as part of cluster start/stop. Client-side, we will take this list of ensemble members and put it together with the hbase.zookeeper.clientPort config. and pass it into zookeeper constructor as the connectString parameter.
    Default:localhost
  • hbase.zookeeper.property.clientPort
    Property from ZooKeeper’s config zoo.cfg. The port at which the clients will connect.
    Default:2181
  • hbase.zookeeper.property.maxClientCnxns
    Property from ZooKeeper’s config zoo.cfg. Limit on number of concurrent connections (at the socket level) that a single client, identified by IP address, may make to a single member of the ZooKeeper ensemble. Set high to avoid zk connection issues running standalone and pseudo-distributed.
    Default:300
  • hbase.client.write.buffer
    Default size of the HTable client write buffer in bytes. A bigger buffer takes more memory — on both the client and server side since server instantiates the passed write buffer to process it — but a larger buffer size reduces the number of RPCs made. For an estimate of server-side memory-used, evaluate hbase.client.write.buffer * hbase.regionserver.handler.count
    Default:2097152
  • hbase.client.retries.number
    Maximum retries. Used as maximum for all retryable operations such as the getting of a cell’s value, starting a row update, etc. Retry interval is a rough function based on hbase.client.pause. At first we retry at this interval but then with backoff, we pretty quickly reach retrying every ten seconds. See HConstants#RETRY_BACKOFF for how the backup ramps up. Change this setting and hbase.client.pause to suit your workload.
    Default:35
  • hbase.client.scanner.caching
    Number of rows that we try to fetch when calling next on a scanner if it is not served from (local, client) memory. This configuration works together with hbase.client.scanner.max.result.size to try and use the network efficiently. The default value is Integer.MAX_VALUE by default so that the network will fill the chunk size defined by hbase.client.scanner.max.result.size rather than be limited by a particular number of rows since the size of rows varies table to table. If you know ahead of time that you will not require more than a certain number of rows from a scan, this configuration should be set to that row limit via Scan#setCaching. Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer times when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout; i.e. hbase.client.scanner.timeout.period
    Default:2147483647
  • hbase.client.scanner.timeout.period
    Client scanner lease period in milliseconds.
    Default:60000
  • hbase.hregion.memstore.flush.size
    Memstore will be flushed to disk if size of the memstore exceeds this number of bytes. Value is checked by a thread that runs every hbase.server.thread.wakefrequency.
    Default:134217728
  • hbase.hregion.max.filesize
    Maximum HFile size. If the sum of the sizes of a region’s HFiles has grown to exceed this value, the region is split in two.
    Default:10737418240
  • hbase.hregion.majorcompaction
    Time between major compactions, expressed in milliseconds. Set to 0 to disable time-based automatic major compactions. User-requested and size-based major compactions will still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause compaction to start at a somewhat-random time during a given window of time. The default value is 7 days, expressed in milliseconds. If major compactions are causing disruption in your environment, you can configure them to run at off-peak times for your deployment, or disable time-based major compactions by setting this parameter to 0, and run major compactions in a cron job or by another external mechanism.
    Default:604800000
  • hfile.block.cache.size
    Percentage of maximum heap (-Xmx setting) to allocate to block cache used by a StoreFile. Default of 0.4 means allocate 40%. Set to 0 to disable but it’s not recommended; you need at least enough cache to hold the storefile indices.
    Default:0.4
  • hbase.rpc.timeout
    This is for the RPC layer to define how long HBase client applications take for a remote call to time out. It uses pings to check connections but will eventually throw a TimeoutException.
    Default:60000
  • hbase.table.max.rowsize
    Maximum size of single row in bytes (default is 1 Gb) for Get’ting or Scan’ning without in-row scan flag set. If row size exceeds this limit RowTooBigException is thrown to client.
    Default:1073741824
  • hbase.snapshot.enabled
    Set to true to allow snapshots to be taken / restored / cloned.
    Default:true

hbase-env.sh

Set HBase environment variables in this file. Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbage collector configs. You can set environment variables like JAVA_HOME,HADOOP_HOME etc in this file. Changes here will require a cluster restart for HBase to notice

log4j.properties

Configuration file for HBase logging via log4j. Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages. Changes here will require a cluster restart for HBase to notice

Example Configurations

Here is an example of basic configuration for a distributed ten node cluster:

  • The nodes are named node0, node1, etc., through node node9 in this example.
  • The HBase Master and the HDFS NameNode are running on the node node0.
  • RegionServers run on nodes node1-node9.
  • A 3-node ZooKeeper ensemble runs on node1, node2, and node3 on the default ports.
  • ZooKeeper data is persisted to the directory /opt/zookeeper/zk_data.

The main configuration files - hbase-site.xml, regionservers, and hbase-env.sh — found in the HBase conf directory might look like below:

hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/zookeeper/zk_data</value>
    <description>Property from ZooKeeper config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://node0:8020/hbase</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
</configuration>

regionservers

In this file you list the nodes that will run RegionServers. In our case, these nodes are example1-example9.

example1
example2
example3
example4
example5
example6
example7
example8
example9

hbase-env.sh

The following lines in the hbase-env.sh file show how to set the JAVA_HOME environment variable (required for HBase 0.98.5 and newer) and set the heap to 4 GB (rather than the default value of 1 GB).

# The java implementation to use.
export JAVA_HOME=/usr/java/jdk1.7.0/

# The maximum amount of heap to use. Default is left to JVM default.
export HBASE_HEAPSIZE=4G