2.23. Apache ZooKeeper

Apache ZooKeeper is a distributed service configuration repository. Java and C bindings are available as part of the project, multiple other bindings are provided by community.

2.23.1. Architecture

ZooKeeper servers maintain synchronized memory state with persistent journal and snapshots. Clients specify a server list and connect to a single server at a time with fail over.

ZooKeeper Architecture

Servers.  Replicated server cluster

  • Each server stores complete state in memory

  • Updates are also stored in persistent log

  • Persistent snapshot done when updates accumulate

Atomic communication protocol

  • All updates pass through leader server

  • Leader collects majority quorum for each update

  • Leader election triggered in case of cluster failure

Clients. 

  • Provided with a list of servers to use

  • Connected to a single server at a time

  • Connection failure handled by switching to another server

2.23.2. Interface

ZooKeeper Data Model

Data. 

  • Tree of named nodes navigated by string paths

  • Support for unique node naming

  • Node data is array of bytes

  • Updates increment version

Some data objects in the interface are generated from platform independent specification.

ZooKeeper Data Objects

module org.apache.zookeeper.data {
    ...
    class Stat {
        long czxid;             // ZXID of transaction that created this node
        long mzxid;             // ZXID of transaction that last modified this node
        long pzxid;             // ZXID of transaction that last modifined node children
        long ctime;             // Node creation time
        long mtime;             // Node last modification time
        int version;            // Node version
        int aversion;           // Node ACL version
        int cversion;           // Node child version
        int dataLength;         // Node data length
        int numChildren;        // Node child count
        long ephemeralOwner;    // Owner identifier for ephemeral nodes
    }
    ...
}

ZooKeeper Blocking Interface

public class ZooKeeper {
    public ZooKeeper (String connectString, int sessionTimeout, Watcher watcher) { ... }
    public ZooKeeper (String connectString, int sessionTimeout, Watcher watcher, boolean canBeReadOnly) { ... }
    ...

    public String create (String path, byte data [], List<ACL> acl, CreateMode createMode) { ... }
    public void delete (String path, int version) { ... }

    public Stat exists (String path, boolean watch) { ... }
    public Stat exists (String path, Watcher watcher) { ... }

    public byte [] getData (String path, boolean watch, Stat stat) { ... }
    public byte [] getData (String path, Watcher watcher, Stat stat) { ... }

    public Stat setData (String path, byte data [], int version) { ... }

    public List<String> getChildren (String path, boolean watch) { ... }
    public List<String> getChildren (String path, boolean watch, Stat stat) { ... }
    public List<String> getChildren (String path, Watcher watcher) { ... }
    public List<String> getChildren (String path, Watcher watcher, Stat stat) { ... }

    // Make sure the server is current with the leader.
    public void sync (String path, VoidCallback cb, Object ctx) { ... }

    public synchronized void close () { ... }
}

ZooKeeper Non Blocking Interface

public class ZooKeeper {
    ...

    public void create (
        String path, byte data [],
        List<ACL> acl, CreateMode createMode,
        StringCallback cb, Object ctx) { ... }

    public void delete(String path, int version, VoidCallback cb, Object ctx) { ... }

    public void exists (String path, boolean watch, StatCallback cb, Object ctx) { ... }
    public void exists (String path, Watcher watcher, StatCallback cb, Object ctx) { ... }

    public void getData (String path, boolean watch, DataCallback cb, Object ctx) { ... }
    public void getData (String path, Watcher watcher, DataCallback cb, Object ctx) { ... }

    public void setData (String path, byte data [], int version, StatCallback cb, Object ctx) { ... }

    public void getChildren (String path, boolean watch, ChildrenCallback cb, Object ctx) { ... }
    public void getChildren (String path, boolean watch, Children2Callback cb, Object ctx) { ... }
    public void getChildren (String path, Watcher watcher, ChildrenCallback cb, Object ctx) { ... }
    public void getChildren (String path, Watcher watcher, Children2Callback cb, Object ctx) { ... }

    ...
}

public interface StatCallback extends AsyncCallback {
    public void processResult (int rc, String path, Object ctx, Stat stat);
}

public interface DataCallback extends AsyncCallback {
    public void processResult (int rc, String path, Object ctx, byte data [], Stat stat);
}

public interface ChildrenCallback extends AsyncCallback {
    public void processResult (int rc, String path, Object ctx, List<String> children);
}

public interface Children2Callback extends AsyncCallback {
    public void processResult (int rc, String path, Object ctx, List<String> children, Stat stat);
}

...

ZooKeeper Multiple Operations Interface

public class ZooKeeper {
    ...

    // Execute multiple operations atomically.
    public List<OpResult> multi (Iterable<Op> ops) { ... }
    public void multi (Iterable<Op> ops, MultiCallback cb, Object ctx) { ... }

    ...
}

public abstract class Op {
    private int type;
    private String path;

    private Op (int type, String path) {
        this.type = type;
        this.path = path;
    }

    public static Op create (String path, byte [] data, List<ACL> acl, int flags) {
        return new Create (path, data, acl, flags);
    }

    public static class Create extends Op {
        private byte [] data;
        private List<ACL> acl;
        private int flags;

        private Create (String path, byte [] data, List<ACL> acl, int flags) {
            super (ZooDefs.OpCode.create, path);
            this.data = data;
            this.acl = acl;
            this.flags = flags;
        }

        ...
    }

    ...
}


public abstract class OpResult {
    private int type;

    private OpResult (int type) {
        this.type = type;
    }

    public static class CreateResult extends OpResult {
        private String path;

        public CreateResult (String path) {
            super (ZooDefs.OpCode.create);
            this.path = path;
        }

        ...
    }

    ...
}

ZooKeeper Watcher Interface

public class ZooKeeper {
    ...

    // Manage watches with explicit mode.
    void addWatch (String basePath, AddWatchMode mode);
    void removeWatches (String path, Watcher watcher, Watcher.WatcherType watcherType, boolean local);

    ...
}


public enum AddWatchMode {
    PERSISTENT (0),
    PERSISTENT_RECURSIVE (1);
}


public interface Watcher {

    abstract public void process (WatchedEvent event);

    public interface Event {
        public enum EventType {
            None (-1),
            NodeCreated (1),
            NodeDeleted (2),
            NodeDataChanged (3),
            NodeChildrenChanged (4);

            ...
        }
    }
}


public class WatchedEvent {
    ...

    public KeeperState getState () { ... }
    public EventType getType () { ... }
    public String getPath () { ... }
}

  • One shot watches are removed after every event

  • Persistent watches stay until removed explicitly

  • Recursive watches also report events on children

Watchers will receive notification on connection failures but non delivered events are considered lost afterwards.

2.23.3. Recipes

The atomicity and consistency guarantees provided by Apache ZooKeeper can be used to implement multiple high level recipes. Such implementations are provided by the Apache Curator project.

Curator Recipes

Agreement. 

GroupMember

group membership tracking

LeaderLatch

leader election with polling interface

LeaderSelector

leader election with callback interface

Synchronization. 

DistributedBarrier

barrier with explicit state setting calls

DistributedDoubleBarrier

barrier with node count condition

InterProcessMutex

recursive lock

InterProcessSemaphoreMutex

non recursive lock

InterProcessReadWriteLock

recursive read write lock

InterProcessSemaphore

semaphore

InterProcessMultilock

wrapper for acquiring multiple locks atomically

Communication. 

SimpleDistributedQueue

backwards compatible queue

DistributedQueue

ordered queue with optional item identities

DistributedDelayQueue

queue with delayed delivery

DistributedPriorityQueue

queue with priorities

SharedCount

shared integer counter

DistributedAtomicLong

shared long integer counter

Resiliency. 

CuratorCache

generic local path cache

PersistentNode

connection loss resistant node interface

PersistentTTLNode

connection loss resistant node interface with keepalive

PersistentWatcher

connection loss resistant watch interface

2.23.4. References

  1. The Apache ZooKeeper Project Home Page. https://zookeeper.apache.org

  2. The Apache Curator Project Home Page. https://curator.apache.org