BIG
DATA

JAVA

What is Zookeeper

Read more about »
  • Java 9 features
  • Read about Hadoop
  • Read about Storm
  • Read about Storm
 
Apache Zookeeper logo
Written in Java
Operating system Cross-platform
Type Distributed computing
License Apache License 2.0
Website zookeeper.apache.org

ZooKeeper is a distributed, open-source coordination service for distributed applications. It is also called as 'King of Coordination'. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.

Why Apache Zookeeper

In the good old past, each application software was a single program running on a single computer with a single CPU. Today, things have changed. In the Big Data world, application softwares are made up of many independent programs running on an ever-changing set of computers. These applications are known as Distributed Application. A distributed application can run on multiple systems in a network simultaneously by coordinating among themselves to complete a particular task in a fast and efficient manner.

Building distributed systems is hard. Nowadays, a lot of the software applications people use daily, however, depend on such systems, and it doesn’t look like we will stop relying on distributed computer systems any time soon. Coordinating the actions of the independent programs in a distributed systems is far more difficult than writing a single program to run on a single computer. It is easy for developers to get mired in coordination logic and lack the time to write their application logic properly or perhaps the converse, to spend little time with the coordination logic and simply to write a quick-and-dirty master coordinator that is fragile and becomes an unreliable single point of failure.

ZooKeeper was designed to be a robust service that enables application developers to focus mainly on their application logic rather than coordination. It exposes a simple API, inspired by the filesystem API, that allows developers to implement common coordination tasks, such as electing a master server, managing group membership, and managing metadata. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers.

When designing an application with ZooKeeper, one ideally separates application data from control or coordination data.

ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.

What is distributed system?

A distributed system is a collection of independent computers that appear to the users of the system as a single computer. A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

What is Distributed Computing?

A distributed application is software that is executed or run on multiple computers within a network and can be stored on servers or with cloud computing. These applications interact in order to achieve a specific goal or task. Traditional applications relied on a single system to run them. Unlike traditional applications that run on a single system, distributed applications run on multiple systems simultaneously for a single task or job.

The word distributed means data being spread out over more than one computer in a network. Distributed applications are broken up into two separate programs: the client software and the server software. The client accesses the data from the server, while the server processes the data. Cloud computing can be used instead of servers or hardware to process a distributed application's data. With distributed applications, if a node that is running a particular application goes down, another node can resume the task.

Normally, huge, complex and time-consuming tasks, which will take hours to complete by an application running in a single system can be done in minutes by a distributed application by using computing capabilities of all the system involved within the dedicated distributed system. Time can be further reduced by configuring the distributed application to run on more systems. In a distributed application architecture a group of systems in which a distributed application is running is called a Cluster and each machine in a cluster is called a Node.

An distributed application architecture has two parts, Server and Client application. Server applications are the ones which is distributed and have a common interface so that clients can connect to any server in the cluster and get the prcess done.

Advantages and Disadvantages of a Distributed Application

Running a application on multiple computers within a network has its share of advantages and disadvantages. Lets see them:

Advantages of a Distributed Application

  • Scalability- The load an application can sustain can be increased by placing extra server processes in a group; adding machines to the application and redistributing the groups across the machines; replicating a group onto other machines within the application and using load balancing; segmenting a database and using data-dependent routing to reach the groups dealing with these separate database segments.
  • Ease of development/maintainability- The separation of the business application logic into services or components that communicate through well-defined messages or interfaces allows both development and maintenance to be similarly separated and so simplified.
  • Resilience- When multiple machines are in use and one fails, the remainder can continue operation. Similarly, when multiple server processes are within a group and one fails, the others are present to perform work. Finally, if a machine should break, but there are multiple machines within the application, these other machines can be used to perform the work of the application.
  • Coordination of autonomous actions- If you have separate applications, you can coordinate autonomous actions among the applications. You can coordinate autonomous actions as a single logical unit of work. Autonomous actions are actions that involve multiple server groups and/or multiple resource manager interfaces.

Disadvantages of a Distributed Application

  • Complexities- Typically, distributed systems are Typically, distributed systems are more complex than centralised systems
  • Network reliance- Messages can be lost in the communication network, which requires special software to be able to recover, and it can become overloaded.
  • Security- Distributed systems are more susceptible to external attack.
  • Multiple point of failure- Due to the huge no of nodes involved there could be more than one point of failure.
  • Manageability- More effort is required for system management.
    • Race condition: Two or more processes trying to perform a particular task, which actually needs to be done by one.
    • Deadlock: Two or more operations waiting for each other to complete indefinitely.
    • Unpredictability: Unpredictable responses depending depending on the system organisation and network load.

Despite these potential problems, many people feel that the advantages outweigh the disadvantages, and it is expected that distributed systems will become increasingly important in the coming years. In fact, it is likely that within a few years, most organizations will connect most of their computers into large distributed systems to provide better, cheaper, and more convenient service for the users.

The Origin of the Name “ZooKeeper”

ZooKeeper was developed at Yahoo! Research. Yahoo had been working on ZooKeeper for a while and pitching it to other groups. At the time the ZooKeeper group had been working with the Hadoop team and had started a variety of projects with the names of animals, Apache Pig being the most well known. As the group started talking about different possible names, one of the group members mentioned that they should avoid another animal name because it started to sound like a zoo. That is when it clicked: distributed systems are a zoo. They are chaotic and hard to manage, and ZooKeeper is meant to keep them under control.

Projects which uses ZooKeeper

  • Apache BookKeeper- BookKeeper(ZooKeeper subproject) is a replicated service to reliably log streams of records.
  • Apache Hadoop MapReduce- The next generation of Hadoop MapReduce (colled "Yarn") uses ZooKeeper.
  • Apache HBase- HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable. HBase uses ZooKeeper for master election, server lease management, bootstrapping, and coordination between servers.
  • Apache Kafka- Kafka is a distributed publish/subscribe messaging system. Kafka queue consumers uses Zookeeper to store information on what has been consumed from the queue.
  • Apache Storm- Storm uses Zookeeper to store all state so that it can recover from an outage in any of its (distributed) component services.

World without ZooKeeper

When designing a distributed system, there is typically a need for designing and developing some coordination services:

  • Name service— A naming service is a service that maps a name to some information associated with that name. A telephone directory is a name service that maps the name of a person to his/her telephone number. In the same way, a DNS service is a name service that maps a domain name to an IP address. In your distributed system, you may want to keep a track of which servers or services are up and running and look up their status by name.
  • Locking— To allow for serialized access to a shared resource in your distributed system, you may need to implement distributed mutexes.
  • Synchronization— Hand in hand with distributed mutexes is the need for synchronizing access to shared resources. Whether implementing a producer-consumer queue or a barrier.
  • Configuration management— The configuration of your distributed system must centrally stored and managed.This means that any new nodes joining should pick up the up-to-date centralized configuration as soon as they join the system.
  • Leader election— Your distributed system may have to deal with the problem of nodes going down, and you may want to implement an automatic fail-over strategy. You can do this by leader election.

Previous systems in a distributed sytems have implemented components like distributed lock managers or have used distributed databases for coordination. While it's possible to design and implement all of these services from scratch, it's extra work and difficult to debug any problems, race conditions, or deadlocks. Just like you don't go around writing your own hashing function in your code, there was a need that people shouldn't go around writing their own name services or leader election services from scratch every time they need it. Moreover, you could hack together a very simple group membership service relatively easily, but it would require much more work to write it to provide reliability, replication, and scalability. This led to the development and open sourcing of Apache ZooKeeper, an out-of-the box reliable, scalable, and high-performance coordination service for distributed systems.

ZooKeeper, in fact, borrows a number of concepts from these prior systems. It does not expose a lock interface or a general purpose interface for storing data, however. The design of ZooKeeper is specialized and very focused on coordination tasks. It is certainly possible to build distributed systems without using ZooKeeper. ZooKeeper, however, offers developers the possibility of focusing more on application logic rather than on arcane distributed systems concepts. Programming distributed systems without ZooKeeper is possible, but more difficult.

Why is Distributed Systems Coordination Hard?

When an application starts up, all of the different processes needs to find the application configuration. Over time this configuration may change. We could shut everything down, redistribute configuration files, and restart, but that may incur extended periods of application downtime during reconfiguration. Also as the load changes, we want to be able to add or remove new machines and processes.

The problems described above are functional problems that you can design solutions for and you can test your solutions before deployment. But the truly difficult problems encounter, when the distributed applications have to do with faults specifically, crashes and communication faults. These failures can crop up at any point, and it may be impossible to enumerate all the different cases that need to be handled.

One of the diferences between single machine and distributed applications is: When a single machine crashes, all the processes running on that machine fail. If there are multiple processes running on the machine and a process fails, the other processes can find out about the failure from the operating system. The operating system can also provide strong messaging guarantees between processes. All of this changes in a distributed environment: if a machine or process fails, other machines will keep running and may need to take over for the faulty processes. To handle faulty processes, the processes that are still running must be able to detect the failure; messages may be lost, and there may even be clock drift.

Okay, so we cannot have an ideal fault-tolerant, distributed, real-world system that transparently takes care of all problems that might ever occur. We can strive for a slightly less ambitious goal, though.

Having pointed out that the perfect solution is impossible, we can repeat that ZooKeeper is not going to solve all the problems that the distributed application developer has to face. It does give the developer a nice framework to deal with these problems, though.

What does a ZooKeeper do?

ZooKeeper is itself a distributed application providing services for developing a distributed application. It coordinates a group of nodes within the cluster and maintains shared data with effective synchronization techniques. Some of the services provided by zookeeper are:

  • ZooKeeper exposes a simple interface for Naming service which identifies the nodes in a cluster by name simialr to DNS.
  • ZooKeeper provides for an easy way for you to implement distributed mutexes to allow for serialized access to a shared resource in your distributed system.
  • You can use ZooKeeper to centrally store and manage the configuration of your distributed system. This means that any new nodes joining will pick up the up-to-date centralized configuration from ZooKeeper as soon as they join the system. This also allows you to centrally change the state of your distributed system by changing the centralized configuration through one of the ZooKeeper clients.
  • ZooKeeper provides off-the-shelf support for leader election which will deal with the problem of nodes going down.

From the overview chapter we have understood the requirements of distributed applications at a high level. In the next chapters we will learn about zookeeper.