Zookeeper Internals - Sessions, Requests, Zab, Snapshots

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm


Before executing any request to a ZooKeeper ensemble, a client must establish a session with the service.Sessions are very important for the operation of ZooKeeper. All operations a client submits to ZooKeeper are associated to a session. When a session ends for any reason, the ephemeral nodes created during that session disappear.

When a client creates a ZooKeeper handle, it establishes a session with the service. The client initially connects to any server in the ensemble, and only to a single server. It uses a TCP connection to communicate with the server, but the session may be moved to a different server if the client has not heard from its current server for some time. Moving a session to a different server is handled transparently by the ZooKeeper client library.

Sessions offer order guarantees

Requests in a session are executed in FIFO(first in, first out) order. Typically, a client has only a single session open, so its requests are all executed in FIFO order. If a client has multiple concurrent sessions, FIFO ordering is not necessarily preserved across the sessions. Once a client connects to a server, the session will be established and a session id is assigned to the client. The client sends heartbeats to keep the session valid. If the ZooKeeper ensemble does not receive heartbeats from a client for more than the period agreed at the starting of the service, zooKeeper decides that the client is dead. When this happens session is ended and followed by deletion of ephemeral znodes created during that session.

States and the Lifetime of a Session

The lifetime of a session is the period between its creation and its end, whether it is closed gracefully or expires because of a timeout. The possible states of a session are : CONNECTING, CONNECTED, CLOSED, and NOT_CONNECTED.

ZooKeeper Session States

A session starts at the NOT_CONNECTED state and transitions to CONNECTING (arrow 1) with the initialization of the ZooKeeper client. Normally, the connection to a ZooKeeper server succeeds and the session transitions to CONNECTED (arrow 2). When the client loses its connection to the ZooKeeper server or doesn’t hear from the server, it transitions back to CONNECTING (arrow 3) and tries to find another ZooKeeper server. If it is able to find another server or to reconnect to the original server, it transitions back to CONNECTED once the server confirms that the session is still valid. Otherwise, it declares the session expired and transitions to CLOSED (arrow 4). The application can also explicitly close the session (arrows 4 and 5).

One important parameter you should set when creating a session is the session timeout, which is the amount of time the ZooKeeper service allows a session before declaring it expired. If the service does not see messages associated to a given session during time t, it declares the session expired. On the client side, if it has heard nothing from the server at 1/3 of t, it sends a heartbeat message to the server. At 2/3 of t, the ZooKeeper client starts looking for a different server, and it has another 1/3 of t to find one.

When trying to connect to a different server, it is important for the ZooKeeper state of this server to be at least as fresh as the last ZooKeeper state the client has observed. A client cannot connect to a server that has not seen an update that the client might have seen. ZooKeeper determines freshness by ordering updates in the service.

ZooKeeper Requests and Transactions

ZooKeeper servers process read requests (exists, getData, and getChildren) locally. When a server receives, say, a getData request from a client, it reads its state and returns it to the client. Because it serves requests locally, ZooKeeper is pretty fast at serving readdominated workloads. We can add more servers to the ZooKeeper ensemble to serve more read requests, increasing overall throughput capacity. The leader executes the request, producing a state update that we call a transaction. Whereas the request expresses the operation the way the client originates it, the transaction comprises the steps taken to modify the ZooKeeper state to reflect the execution of the request.

Let’s Say that a client submits a setData request on a given znode /z. setData should change the data of the znode and bump up the version number. So, a transaction for this request contains two important fields: the new data of the znode and the new version number of the znode. When applying the transaction, a server simply replaces the data of /z with the data in the transaction and the version number with the value in the transaction, rather than bumping it up.

A transaction is treated as a unit, in the sense that all changes it contains must be applied atomically. In the setData example, changing the data without an accompanying change to the version accordingly leads to trouble. Consequently, when a ZooKeeper ensemble applies transactions, it makes sure that all changes are applied atomically and there is no interference from other transactions. A transaction is also idempotent. That is, we can apply the same transaction twice and we will get the same result.

When the leader generates a new transaction, it assigns to the transaction an identifier that we call a ZooKeeper transaction ID (zxid). Zxids identify transactions so that they are applied to the state of servers in the order established by the leader. Servers also exchange zxids when electing a new leader, so they can determine which nonfaulty server has received more transactions and can synchronize their states. A zxid is a long (64-bit) integer split into two parts: the epoch and the counter. Each part has 32 bits.

Zab - ZooKeeper Atomic Broadcast protocol

When the client sends a write request, a follower forwards it to the leader. The leader executes the request and broadcasts the result of the execution as a state update, in the form of a transaction. A transaction comprises the exact set of changes that a server must apply to the data tree when the transaction is committed. The data tree is the data structure holding the ZooKeeper state. The server determines that a transaction has been committed by following a protocol called Zab-the ZooKeeper Atomic Broadcast protocol.

Assuming that there is an active leader and it has a quorum of followers supporting its leadership, the protocol to commit a transaction is very simple:

  • The leader sends a PROPOSAL message, p, to all followers.
  • Upon receiving p, a follower responds to the leader with an ACK, informing the leader that it has accepted the proposal.
  • Upon receiving acknowledgments from a quorum (the quorum includes the leader itself), the leader sends a message informing the followers to COMMIT it.
ZAB-ZooKeeper Atomic Broadcast protocol

Before acknowledging a proposal, the follower needs to perform a couple of additional checks. The follower needs to check that the proposal is from the leader it is currently following, and that it is acknowledging proposals and committing transactions in the same order that the leader broadcasts them in.
Zab guarantees a couple of important properties:

  • If the leader broadcasts T and Tʹ in that order, each server must commit T before committing Tʹ .
  • If any server commits transactions T and Tʹ in that order, all other servers must also commit T before Tʹ .

The first property guarantees that transactions are delivered in the same order across servers, whereas the second property guarantees that servers do not skip transactions. Given that the transactions are state updates and each state update depends upon the previous state update, skipping transactions could create inconsistencies. The twophase commit guarantees the ordering of transactions.

Zookeeper Snapshots

Snapshots are copies of the ZooKeeper data tree. Each server frequently takes a snapshot of the data tree by serializing the whole data tree and writing it to a file. The servers do not need to coordinate to take snapshots, nor do they have to stop processing requests. Because servers keep executing requests while taking a snapshot, the data tree changes as the snapshot is taken. We call such snapshots fuzzy, because they do not necessarily reflect the exact state of the data tree at any particular point in time. Let’s say that a data tree has only two znodes: /z and /z'. Initially, the data of both /z and /z' is the integer 1. Now consider the following sequence of steps:

  • Start a snapshot.
  • Serialize and write /z = 1 to the snapshot.
  • Set the data of /z to 2 (transaction T).
  • Set the data of /z' to 2 (transaction Tʹ ).
  • Serialize and write /z' = 2 to the snapshot.

This snapshot contains /z = 1 and /z' = 2. However, there has never been a point in time in which the values of both znodes were like that. This is not a problem, though, because the server replays transactions. It tags each snapshot with the last transaction that has been committed when the snapshot starts—call it TS. If the server eventually loads the snapshot, it replays all transactions in the transaction log that come after TS. In this case, they are T and Tʹ . After replaying T and Tʹ on top of the snapshot, the server obtains /z = 2 and /z' = 2, which is a valid state. There is no problem with applying Tʹ again because, transactions are idempotent. So as long as we apply the same transactions in the same order, we will get the same result even if some of them have already been applied to the snapshot.