Author Archives: icorda

Kafka and Zookeeper: main concepts

What is Kafka

Apache Kafka is a distributed real-time streaming platform whose primarily use cases are those requiring high throughput, reliability, and replication characteristics not achievable with ideal performance by applications like JMS, RabbitMQ, and AMQP

Generally speaking, a Big Data streaming platform offers 3 main capabilities:

  • Publishing and subscribing to streams of records, similar to a message queue or enterprise messaging system;
  • Storing streams of records in a fault-tolerant durable way;
  • Processing streams of records as they occur.

Kafka’s Applications and Case Studies

Some of the companies that are using Apache Kafka in their respective use cases are as follows:

  • LinkedIn: Apache Kafka is used at LinkedIn activity data streaming and operational metrics. This data powers various products such as LinkedIn News Feed and LinkedIn Today.
  • Twitter uses Kafka as a part of its Storm (now Herion actually)—a stream-processing infrastructure. Here is an account of Twitter’s Kafka adoption.
  • Foursquare : Kafka powers online-to-online and online-to-offline messaging at Foursquare. It is used to integrate Foursquare monitoring and production systems with Foursquare-and Hadoop-based offline infrastructures.

Kafka: main concepts

A Kafka cluster primarily has 5 main components:

  • Topic: A topic is a category or feed name to which messages are published by the message producers. In Kafka, topics are partitioned and each partition is represented by the ordered immutable sequence of messages. A Kafka cluster maintains the partitioned log for each topic. Each message in the partition is assigned a unique sequential ID called the offset.
  • Broker: A Kafka cluster consists of one or more servers where each one may have one or more server processes running and is called the broker. Topics are created within the context of broker processes.
  • Zookeeper: It serves as the coordination interface between the Kafka broker and consumers. From the Hadoop Wiki ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  • Producers: They publish data to the topics by choosing the appropriate partition within the topic. For load balancing, the allocation of messages to the topic partition can be done in a round-robin fashion or using a custom defined function.
  • Consumers: They are the applications or processes that subscribe to topics and process the feed of published messages.

What is Zookeeper

ZooKeeper is a centralised service for maintaining configuration information, naming, providing distributed synchronisation and group services. In a nutshell, Zookeeper is a coordination interface that allows communication between Kafka and the consumer. The main difference between Zookeeper and the normal filesystems lies in the concept of znode. Every znode is identified by a name and separated by a sequence of path (/).

  • at the highest level, there is a root znode separated by “/” under which, there are 2 logical namespaces, namely config and workers.
  • The config namespace is used for centralized configuration management and the workers namespace is used for naming.
  • Under config namespace, each znode can store upto 1MB of data. The main purpose of such structure (also called ZooKeeper Data Model) is to store synchronized data and describe the metadata of the znode.

Where to go from here

Lots of resources can be found on line, just a few to begin your journey with distributed messaging services:

Apache Kafka Home

Apache Kafka Github Repo

Apache Kafka for Beginners

Big Data Messaging with Kafka

Apache Zookeeper HomePage

Apache Zookeeper GitHub Repo

Spring Cloud Zookeeper

How to configure Zookeeper



Setting up your Deep Learning Environment (Mac)

So, you have embarked into your Deep Learning journey and perhaps you are navigating through the concepts of Gradient Descent, Back-propagation and so forth. After all the theory you are eager to get your environment ready to do some actual ‘deep learning hard work’ and you have no idea where to start. You are in the right place then. This short tutorial has been put together for Mac user (sorry Windows aficionados) and will provide you with what you need to get started.

Yes, you need Python!

Sure you know that Python is the key programming language when it comes to Machine and Deep Learning. Make sure you have our beloved HomeBrew:

/usr/bin/ruby -e “$(curl -fsSL"

Install Python 3 (with this version, pip3 will be automatically installed)

brew install python3

Virtual Environment

In order to keep things clean and contain all your deep learning related dependencies in one space, it is useful to use virtual environments.

pip3 install virtualenv virtualenvwrapper

You will also need to modify your bash profile file:

vim ~/.bash_profile

by adding the following:

# virtualenv and virtualenvwrapper
export WORKON_HOME=$HOME/.virtualenvs
export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/python3
source /usr/local/bin/

Next step is to create a virtual environment for your deep learning project:

mkvirtualenv cv -p python3

This will create a virtual environment named cv and in order to come out of such instance, you will need to type the command deactivate

Some Additional Dependencies

You will also need to install cmake to be able to use dlib, a C++ toolkit containing Machine Learning algorithm:

brew install cmake

Additionally, you will need to download X11 to display the image’s outputs from both dlib and opencv target=”_blank”

Let’s install the real stuff

Situate yourself inside your virtual environment by typing the following:

workon cv

Some additional dependencies should be taken care of:

pip install numpy h5py pillow scikit-image

Finally, we can install OpenCV:

pip install opencv-python

Then, we will be installing Dlib, Tensorflow and Keras:

pip install dlib
pip install tensorflow
pip install keras

Keras, in particular, is a user friendly, beginner library for Machine Learning and Deep Learning models that runs on top of Tensorflow. Happy Machine Learning modelling 🙂


Clearing the Confusion: AI vs Machine Learning vs Deep Learning Differences

Perhaps the most basic question for beginners when learning about Machine Learning and Deep Learning.

Read Parquet Files with SparkSQL

SparkSQL is a Spark module for working with structure data and it can also be used to read columnar data format such as Parquet files.  Here a number of useful commands that can be run from the spark-shell:

#Set the context

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

#Read the parquet file in HDFS and

val df =“hdfs://user/myfolder/part-r-00033.gz.parquet”).printSchema

#Show the top 10 rows of data from the parquet file, false)

#Convert to JSON and print out the content of 1 record


Jenkins Best Practices – Practical Continuous Deployment in the Real World — GoDaddy Open Source HQ

Source: Jenkins Best Practices – Practical Continuous Deployment in the Real World — GoDaddy Open Source HQ

Java Beans and DTOs

DTO (Data Transfer Object)

Data Transfer Object is a pattern whose aim is to transport data between layers and tiers of a program. A DTO should contain NO business logic

public class UserDTO {
    String firstName;
    String lastName;
    List<String> groups;

    public String getFirstName() {
        return firstName;
    public void setFirstName(String firstName) {
        this.firstName = firstName;
    public String getLastName() {
        return lastName;
    public void setLastName(String lastName) {
        this.lastName = lastName;

    public List<String> getGroups() {
        return groups;
    public void setGroups(List<String> groups) {
        this.groups = groups;

Java Beans

Java Beans are classes that follows certain conventions or event better they are Sun/Oracle standards/specifications as explained here:

Essentially, Java Beans adhere to the following:

  • all properties are private (and they are accessed through getters and setters);
  • they have zero-arg constructors (aka default constructors)
  • they implement the Serializable Interface

The main reason why we use Java Beans is to encapsulate

public classBeanClassExample() implements {

  private int id;

  //no-arg constructor
  public BeanClassExample() {

  public int getId() {
    return id;

  public void setId(int id) { = id;

So, yeah what is the real difference? If any?

In a nutshell, Java Beans follow strict conditions (as discussed above) and contain no behaviour (as opposed to states), except made for storage, retrieval, serialization and deserialization. It is indeed a specification, while DTO (Data Transfer Object) is a Pattern on its own. It is more than acceptable to use a Java Bean to implement a DTO pattern.

Avro is amazing!

Why Avro For Kafka Data?

Scala, give me a break :)

I have been recently asked whether it is possible to use break (and continue as well) in a loop with Scala and it occurred to me that I have never come across such a case. Coming from Java, I do know how to employ break and continue in a while loop, for example, so why would it be different in Scala, considering that it is builds on top of the JVM? It is actually a bit more complicated than that. Although Scala does not specifically have the keywords break and continue, it does offer similar functionality through scala.util.control.Breaks.

Here is an example of how to use break from the Class Breaks, as follows:

import scala.util.control.Breaks._

val in = new BufferReader(new InputStreamReader(

breakable {
  while (true) {
    println ("? ")
    if (input.readLine() == "") break

In Java, the above would corresponding to this:

BufferedReader in =
   new BufferedReader(new InputStreamReader(;
   while (true) {
     if (in.readLine() == "") break

The breakable function has become available from Scala 2.8 onwards and before that we would have tacked the issues mostly through 2 approaches:

  • by adding a boolean variable indicating whether the loops keeps being valid;
  • by re-writing the loop as a  recursive function;

Happy Scala programming 🙂


2 minutes to spare: Apache NiFi on Mac

As a Mac user, I usually run Apache NiFi using one of the two approaches:

  • by standing up a Docker container;
  • by downloading and installing locally on your Mac;

Running a NiFi Container

You can install Docker on Mac via Homebrew:

brew install docker

Alternatively it is possible to download the Docker Community Edition (CE): an easy to install desktop app for building, packaging and testing dockerised apps, which includes tools such as Docker command line, Docker compose and Docker Notary

After installing Docker, this will let you pull the NiFi image:

docker pull apache/nifi:1.5.0

Next, we can start the image and watch it run:

docker run -p 8080:8080 apache/nifi:1.2.0

Downloading and Installing NiFi locally

Installing Apache NiFi on Mac is quite straightforward, as follows:

brew install nifi

This assumes that you have Homebrew installed. If that is not the case, this is the command you will need:

ruby -e "$(curl -fsSL" < /dev/null 2> /dev/null

Here is where NiFi has been installed:


Some basic operations can be done with these commands:

bin/ run, it runs in the foreground,

bin/ start, it runs in the background

bin/ status, it checks the status

bin/ stop, it stops NiFi

Next step, whatever approach you took at the beginning, is to verify that your NiFi installation/dockerised version is running. This is as simple as visiting the following URL:


Happy Nif-ing 🙂

Machine Learning’s ‘Amazing’ Ability to Predict Chaos

Machine Learning’s ‘Amazing’ Ability to Predict Chaos

Download SQUID – Your News Buddy