BIG
DATA

JAVA

Zookeeper Leader Election

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 

How is a leader elected in Apache ZooKeeper?

The leader is a server that has been chosen by an ensemble of servers and that continues to have support from that ensemble. The purpose of the leader is to order client requests that change the ZooKeeper state: create, setData, and delete. The leader transforms each request into a transaction, and proposes to the followers that the ensemble accepts and applies them in the order issued by the leader.

When a process starts, it enters the ELECTION state. While in this state the process tries to elect a new leader or become a leader. If the process finds an elected leader, it moves to the FOLLOWING state and begins to follow the leader. Processes in the FOLLOWING state are followers. If the process is elected leader, it moves to the LEADING state and becomes the leader. Given that a process that leads also follows, states LEADING and FOLLOWING are not exclusive. A follower transitions to ELECTION if it detects that the leader has failed or relinquished leadership, while a leader transitions to ELECTION once it observes that it no longer has a quorum of followers supporting its leadership.

ZooKeeper Leader Election

Establishing a new leader

Each server starts in the LOOKING state, where it must either elect a new leader or find the existing one. If a leader already exists, other servers inform the new one which server is the leader. At this point, the new server connects to the leader and makes sure that its own state is consistent with the state of the leader. If an ensemble of servers, however, are all in the LOOKING state, they must communicate to elect a leader. They exchange messages to converge on a common choice for the leader. The server that wins this election enters the LEADING state, while the other servers in the ensemble enter the FOLLOWING state. The leader election messages are called leader election notifications, or simply notifications.

The protocol is extremely simple. When a server enters the LOOKING state, it sends a batch of notification messages, one to each of the other servers in the ensemble. The message contains its current vote, which consists of the server’s identifier (sid) and the zxid (zxid) of the most recent transaction it executed.

Upon receiving a vote, a server changes its vote according to the following rules:

  • Let voteId and voteZxid be the identifier and the zxid in the current vote of the receiver, whereas myZxid and mySid are the values of the receiver itself.
  • If (voteZxid > myZxid) or (voteZxid = myZxid and voteId > mySid), keep the current vote.
  • Otherwise, change my vote by assigning myZxid to voteZxid and mySid to vote Zxid.

In short, the server that is most up to date wins, because it has the most recent zxid. This simplifies the process of restarting a quorum when a leader dies. If multiple servers have the most recent zxid, the one with the highest sid wins. Once a server receives the same vote from a quorum of servers, the server declares the leader elected. If the elected leader is the server itself, it starts executing the leader role. Otherwise, it becomes a follower and tries to connect to the elected leader.

Leading

A leader proposes operations by queuing them to all connected followers.If a connection to a given follower closes, then the proposals queued to the connection are discarded and the leader considers the corresponding follower down. To mutually detect crashes in a fine-grained and convenient manner, avoiding operating system reconfiguration, leader and followers exchange periodic heartbeats. If the leader does not receive heartbeats from a quorum of followers within a timeout interval, the leader renounces leadership of the epoch, and transitions to the ELECTION state and the whole process starts all over again.

Leader Protocol in a nutshell
  • At startup wait for a quorum of followers to connect
  • Sync with a quorum of followers
    • Tell the follower to delete any transaction that the leader doesn't have
    • Send any transactions that the follower doesn't have
  • Continually
    • Assign an zxid to any message to be proposed and broadcast proposals to followers
    • When a quorum has acked a proposal, broadcast a commit

Following

When a follower emerges from leader election, it connects to the leader. To support a leader, a follower acknowledges its new epoch proposal, and it only does so if the new epoch proposed is later than its own. A follower only follows one leader at a time and stays connected to a leader as long as it receives heartbeats within a timeout interval. If there is an interval with no heartbeat or the TCP connection closes, the follower abandons the leader, transitions to ELECTION and proceeds to elections Phase to start all over again.

Follower protocol in a nutshell
  • Connect to a leader
  • Delete any transactions in the txn log that the leader says to delete
  • Continually
    • Log to the transactions log and send an ack to leader
    • Deliver any committed transactions

Liveness

To sustain leadership, a leader process needs to be able to send messages to and receive messages from followers. In fact, leader process requires that a quorum of followers are up and select it as their leader to maintain its leadership.