ZooKeeper Data Model - Znodes

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm


ZooKeeper architecture

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers. The namespace looks quite similar to a Unix filesystem. The data registers are known as znodes in the ZooKeeper nomenclature. Each znode can be fortified with a byte-array that stores actual data and each znode also holds some metadata information. Further, each znode may be linked to other lower znodes, forming an internal directory structure. Znodes maintain a stat structure that includes version numbers for data changes, Access Control List (ACL) changes, as well as a timestamp. The version number combined with the timestamp allows the ZooKeeper to validate the cache and to coordinate updates. The data that is stored at each znode in a namespace is atomically read and written.

The read requests obtain all the data bytes associated with a znode and a write request replaces all of the data. Each node is associated with an ACL that governs who can do what. From a data model perspective, at the top of the hierarchy is a root that is referred to as / (see Figure) and below the root are n znodes (ZooKeeper nodes). ZooKeeper stores the entire data tree in an in-memory database that is labeled the replicated database. Each znode in the tree stores a maximum of 1MB of data (by default). For recoverability purposes, ZooKeeper efficiently logs updates to disk and forces write operations to be committed to disk prior to applying the changes to the in-memory database.

A znode can be labeled as sequential, which implies that the name that the client provides when creating the znode reflects a prefix. The full name is given by a sequential number that is chosen by the ensemble. In other words, when creating a znode, a request can be made so that the ZooKeeper appends a monotonically increasing counter to the end of the path. So nodes that are created with the sequential flag set contain the value of a monotonically increasing counter that is appended to its name. If n reflects the new znode and p represents the parent znode, then the sequence value of n is never smaller than the value in the name of any other sequential znode ever created under p. This is useful for synchronization purposes where multiple clients request a lock on a resource, as they all can concurrently create a sequential znode on a location. In this scenario, whoever is assigned the lowest number is entitled to the lock.

A znode may also be labeled as ephemeral, which implies that the node gets obliterated as soon as the client that created the znode disconnects (see Figure 1). Such a znode can be used to determine when a client fails. Such a scenario is relevant as the client itself may have responsibilities that have to be taken over by another client. Another use-case revolves around locking where as soon as the client that held the lock disconnects, another client can determine the entitlement for the lock. To reiterate, with ephemeral znodes, when the session ends the znode is deleted. That is the reason why ephemeral znodes are not allowed to have children.

selecting a leader is straightforward with ZooKeeper as well, as the framework allows every server to publish its information in a znode that is both, sequential and ephemeral. So whichever server has the lowest sequential znode is labeled the leader. If the leader or any other server goes offline, its session terminates and its ephemeral node is removed. At that point, all the other server systems can observe who is elected the new leader (the node with the lowest sequential znode). The pattern formed by creating a sequential and ephemeral znode resembles a queue that is observable by all the server systems.

ZooKeeper architecture

With ZooKeeper, clients can set watches on znodes (to allow clients to receive timely notifications of actual changes without requiring a poll operation). When a watch triggers, ZooKeeper sends the client a notification. To reiterate, a watch may be set to trigger an event in scenarios where the znode is changed, removed, or new children are created underneath