How Zookeeper Works?

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm

How Apache ZooKeeper Works Internally

ZooKeeper, while being a coordination service for distributed systems, is a distributed application on its own. ZooKeeper follows a simple client-server model where clients are nodes (i.e., machines) that make use of the service, and servers are nodes that provide the service. A collection of ZooKeeper servers forms a ZooKeeper ensemble. Once a ZooKeeper ensemble starts after the leader election process, it will wait for the clients to connect. At any given time, one ZooKeeper client is connected to one ZooKeeper server. Each ZooKeeper server can handle a large number of client connections at the same time. Each client periodically sends pings to the ZooKeeper server it is connected to let it know that it is alive and connected. The ZooKeeper server in question responds with an acknowledgment of the ping, indicating the server is alive as well. When the client doesn't receive an acknowledgment from the server within the specified time, the client connects to another server in the ensemble, and the client session is transparently transferred over to the new ZooKeeper server.

Apache ZooKeeper architecture

ZooKeeper has a file system-like data model composed of znodes. Think of znodes (ZooKeeper data nodes) as files in a traditional UNIX-like system, except that they can have child nodes. Another way to look at them is as directories that can have data associated with themselves. Each of these directories is called a znode. The znode hierarchy is stored in memory within each of the ZooKeeper servers. This allows for scalable and quick responses to reads from the clients. Each ZooKeeper server also maintains a transaction log on the disk, which logs all write requests. ZooKeeper server must sync transactions to disk before it returns a successful response. The default maximum size of data that can be stored in a znode is 1 MB. Zookeeper should only be used as a storage mechanism for the small amount of data required for providing reliability, availability, and coordination to your distributed application.

Data Access

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper client Read

When a client requests to read the contents of a particular znode, the read takes place at the server that the client is connected to. Client sends a read request to the node with the znode path and the node returns the requested znode. Consequently, since only one server from the ensemble is involved, reads are quick and scalable. Concurrent reads are done as each client is attached to different server and all clients can read from the servers simultaneously. Although having concurrent reads leads to eventual consistency as master is not involved. There can be cases where client may have an outdated view, which gets updated with a little delay.

ZooKeeper client Write

All the writes in Zookeeper go through the Master node, thus it is guaranteed that all writes will be sequential. If a client wants to store data in the ZooKeeper ensemble, it sends the znode path and the data to the server. However, for writes to be completed successfully, a strict majority of the nodes of the ZooKeeper ensemble are required to be available. When the ZooKeeper service is brought up, one node from the ensemble is elected as the leader. When a client issues a write request, the connected server passes on the request to the leader. This leader then issues the same write request to all the nodes of the ensemble. If a strict majority of the nodes (also known as a quorum) respond successfully to this write request, the write request is considered to have succeeded. A successful return code is then returned to the client who initiated the write request. If a quorum of nodes are not available in an ensemble, the ZooKeeper service is nonfunctional.

In zookeeper, concurrent writes cannot be made. Linear writes guarantee can be problematic if Zookeeper is used for write dominant workload. Zookeeper is ideally used for coordinating message exchanges between clients, which involves less writes and more reads.

ZooKeeper Quorum

A quorum is represented by a strict majority of nodes in a ZooKeeper ensemble.

  • You can have one node in your ensemble, but it won't be a highly available or reliable system.
  • If you have two nodes in your ensemble, you would need both to be up to keep the service running because one out of two nodes is not a strict majority.
  • If you have three nodes in the ensemble, one can go down, and you would still have a functioning service (two out of three is a strict majority). For this reason, ZooKeeper ensembles typically contain an odd number of nodes.
  • Having four nodes gives you no benefit over having three with respect to fault-tolerance. As soon as two nodes go down, your ZooKeeper service goes down as 2 nodes wont make a strict majority.
  • On five-node cluster, you would need three to go down before the ZooKeeper service stops functioning.

Reads are always read from the ZooKeeper server connected to the client, so their performance doesn't change as the number of servers in the ensemble changes. However, writes are successful only when written to a quorum of nodes. This means that the write performance decreases as the number of nodes in the ensemble increases since writes have to be written to and coordinated among more servers.

The beauty of ZooKeeper is that it is up to you how many servers you want to run. If you would like to run one server, that's fine from ZooKeeper's perspective; you just won't have a highly reliable or available system. A three-node ZooKeeper ensemble will support one failure without loss of service, which is probably fine for most users and arguably the most common deployment topology. However, to be safe, use five nodes in your ensemble. A five-node ensemble allows you to take one server out for maintenance or rolling upgrade and still be able to withstand a second unexpected failure without interruption of service.

In short three, five, or seven is the most typical number of nodes in a ZooKeeper ensemble. Keep in mind that the size of your ZooKeeper ensemble has little to do with the size of the nodes in your distributed system. The nodes in your distributed system would be clients to the ZooKeeper ensemble, and each ZooKeeper server can handle a large number of clients scalably.

NOTE: Because a quorum of servers is necessary for progress, ZooKeeper cannot make progress in the case that enough servers have permanently failed that no quorum can be formed. It is OK if servers are brought down and eventually boot up again, but for progress to be made, a quorum must eventually boot up.