Monthly Archives: January 2020

Kafka and Zookeeper: main concepts

What is Kafka

Apache Kafka is a distributed real-time streaming platform whose primarily use cases are those requiring high throughput, reliability, and replication characteristics not achievable with ideal performance by applications like JMS, RabbitMQ, and AMQP

Generally speaking, a Big Data streaming platform offers 3 main capabilities:

  • Publishing and subscribing to streams of records, similar to a message queue or enterprise messaging system;
  • Storing streams of records in a fault-tolerant durable way;
  • Processing streams of records as they occur.

Kafka’s Applications and Case Studies

Some of the companies that are using Apache Kafka in their respective use cases are as follows:

  • LinkedIn: Apache Kafka is used at LinkedIn activity data streaming and operational metrics. This data powers various products such as LinkedIn News Feed and LinkedIn Today.
  • Twitter uses Kafka as a part of its Storm (now Herion actually)—a stream-processing infrastructure. Here is an account of Twitter’s Kafka adoption.
  • Foursquare : Kafka powers online-to-online and online-to-offline messaging at Foursquare. It is used to integrate Foursquare monitoring and production systems with Foursquare-and Hadoop-based offline infrastructures.

Kafka: main concepts

A Kafka cluster primarily has 5 main components:

  • Topic: A topic is a category or feed name to which messages are published by the message producers. In Kafka, topics are partitioned and each partition is represented by the ordered immutable sequence of messages. A Kafka cluster maintains the partitioned log for each topic. Each message in the partition is assigned a unique sequential ID called the offset.
  • Broker: A Kafka cluster consists of one or more servers where each one may have one or more server processes running and is called the broker. Topics are created within the context of broker processes.
  • Zookeeper: It serves as the coordination interface between the Kafka broker and consumers. From the Hadoop Wiki ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  • Producers: They publish data to the topics by choosing the appropriate partition within the topic. For load balancing, the allocation of messages to the topic partition can be done in a round-robin fashion or using a custom defined function.
  • Consumers: They are the applications or processes that subscribe to topics and process the feed of published messages.

What is Zookeeper

ZooKeeper is a centralised service for maintaining configuration information, naming, providing distributed synchronisation and group services. In a nutshell, Zookeeper is a coordination interface that allows communication between Kafka and the consumer. The main difference between Zookeeper and the normal filesystems lies in the concept of znode. Every znode is identified by a name and separated by a sequence of path (/).

  • at the highest level, there is a root znode separated by “/” under which, there are 2 logical namespaces, namely config and workers.
  • The config namespace is used for centralized configuration management and the workers namespace is used for naming.
  • Under config namespace, each znode can store upto 1MB of data. The main purpose of such structure (also called ZooKeeper Data Model) is to store synchronized data and describe the metadata of the znode.

Where to go from here

Lots of resources can be found on line, just a few to begin your journey with distributed messaging services:

Apache Kafka Home

Apache Kafka Github Repo

Apache Kafka for Beginners

Big Data Messaging with Kafka

Apache Zookeeper HomePage

Apache Zookeeper GitHub Repo

Spring Cloud Zookeeper

How to configure Zookeeper