What is a Database?
A database is an organized collection of information (data). You can store data in a database and retrieve them when needed. Retrieving data would be faster than traditional data storing methods (Information written on paper etc..). Data stored in a database in key: value pairs.
Modern Databases follow the ACID standard. It consists of,
Atomicity: Method for handling errors. It guarantees that transactions are either committed fully or aborted/failed. If any of the statements in a transaction failed, then the transaction would be considered failed and operation will be aborted. Atomicity provides a guarantee on preventing the desired updates/modifications over the database partially.
Consistency: Guarantee data integrity. It ensures that transactions can alter the database state, only if a transaction is valid and follows all defined rules. For example, a database for bank account information cannot have the same account number for two people.
Isolation: Transactions do not depend on one another. In the real world, transactions are executed concurrently (multiple transactions read/write to the database at the same time). Isolations guarantee concurrent transactions are treated the same way if they happen sequentially (one transaction needs to complete execution for the next one to get executed).
Durability: Guarantees the once committed transaction, will remain committed even in case of a system failure. This means committed/completed transactions will be saved in permanent non-volatile memory. (Eg: Disk drive).
What is Kafka?
According to the main developer of Apache Kafka, Jay Kreps, Kafka is “A System optimized for writing”. Apache Kafka is an Event Streaming Platform, a fast, scalable, fault-tolerant, publish-subscribe messaging system. Kafka is written in programming languages Scala and Java.
Why the need for yet another system?
At the time that Kafka was born, which is 2011. SSDs were not common and when it comes to database transactions, disks were the first bottleneck. LinkedIn wanted a system that is faster, scalable, and highly available. Kafka was born to provide a solution for big data and vertical scaling.
When the need arises to manage a large number of operations, message exchange, monitoring, alarms, and alerts, the system should be faster and reliable. Vertical scaling is upgrading physical hardware (RAM, CPU, etc..) and it’s not ideal always and it may not be cost-effective. Kafka lets the system scale horizontally, coupled with a proper architecture.
LinkedIn later made Kafka open-sourced which results in improving Kafka over time. Before Kafka, LinkedIn used to ingest 1 billion messages per day, now it’s 7 trillion messages, 100 clusters, 4000 brokers, and 100K topics over 7 million partitions.
Strengths of Kafka
- Decouple producers and consumers by using a push-pull model. (Decouple sender and consumer of data).
- Provide persistence for message data within the messaging system to allow multiple consumers.
- Optimize for high throughput of messages.
- Allow for horizontal scaling.
What is the Publish/Subscribe model?
A message pattern that splits senders from receivers.
All the messages are split into classes. Those classes are called Topics in Kafka.
A receiver(subscriber) can register into a particular topic(s), and the receiver will be notified about the topic asynchronously. Those topics are generated by Senders/Publishers. For example, there is a notification for new bank account holders. When someone creates a new bank account, they will automatically get that notification without any prior action from the bank.
The publish/Subscribe model is used to broadcast a message to many subscribers. All the subscribed receivers will get the message asynchronously.
Core concepts in Kafka
Message: The unit of data within Kafka is called a message. You can see a message as a single row in a database. For example, We can take an online purchase transaction. Item purchased, the bank account used to pay, and information delivered can be called a message. Since Kafka doesn’t have a predefined schema, a message can be anything. The message is composed of an array of bytes and messages can be grouped in batches. So we don’t have to wait until one message to get delivered to send another message. A message can contain metadata like the value ‘key’ which is used in partitioning.
Producer: The application that sends data to Kafka is called the Producer. A producer sends and Consumer pulls data, while they don’t need to know one another. Only they should agree upon where to put and how to call the messages. We should always make sure that the Producer is properly configured because Kafka isn’t responsible for sending messages but only for message delivery.
Topic: When the Producer is writing, it’s writing in a Topic. Data sent by the Producer will be stored in a Topic. The producer sets the topic name. You can see Topic as a file. When writing into Topic, data will be appended and when someone is reading, they read from top to bottom. Topics starting with _ are for internal use. Existing data cannot be modified as in a database.
Broker: Kafka instance is called the Broker. It is in charge of storing the topics sent from the Producer, serving data to the Consumer. Each broker is in charge to handle assigned topics, and ZooKeeper holds who is in charge of every Topic. A Kafka cluster is a multitude of instances(Brokers). The broker itself is lightweight, very fast, and without the usual overheads of java like garbage collectors, page & memory management, etc… The broker also handles replication.
Consumer: Reads data from the topic of choice. To access data Consumer needs to subscribe to Topic. Multiple consumers can read into the same Topic.
Kafka and Zookeeper
Kafka works together with a configuration server. The server of choice is Zookeeper, which is a centralized server. Kafka is in charge of holding messages and Zookeeper is in charge of configuration and metadata of those messages. So everything about the configuration of Kafka ends up in Zookeeper. They work together ensuring high availability. There are other options available, but Zookeeper seems to be the industry choice. Both are open-sourced and they work well together.
Zookeeper is very resource-efficient and designed to be highly available. Zookeeper maintains configuration information, naming and provides distributed synchronization. Zookeeper runs in the cluster and Zookeeper cluster must be an odd number (1,3,5..). 3 and 5 guarantees high availability. 1 instance does not because if 1 instance goes down, we lose Zookeeper. Commonly Zookeeper is installed on separate machines than Kafka works because Zookeeper needs to be highly available. There are official plans to abandon Zookeeper to be used with Kafka and in the future, only use Kafka.
Kafka and Zookeeper Installation
In GNU Linux, you must have Java installed (ver.8 or newer). Download Kafka from https://downloads.apache.org/kafka/.
Run the following commands in your terminal (Assumed your downloaded filename is kafka_2.13-2.6.0.tgz).
Run the tarball :
tar -xzf kafka_2.13-2.6.0.tgz
Change the directory into the one with Kafka binaries :
Run Zookeeper :
Start the Kafka broker service :
The downloaded tarball comes with all the binaries, configuration files, and some utilities. System d files, PATH variables adjusted (added to PATH), and configuration for your requirements are not included.
Kafka stores data as Topics. Topics get partitioned & replicated across multiple brokers in a cluster. Producers send data to topics so Consumers could read them.
What is a Kafka Partition?
Topics are split into multiple Partitions and parallelize the work. New Partition writes are appended at the end of the segment. By utilizing Partitions, we can write/read data with multiple Brokers, speeding up the process and thus reducing bottleneck and it adds scalability to the system.
A Topic named topic name is divided into four partitions and each partition can be written/read using a different Broker. New data will be written at the end of each Zpartition by its respective Broker.
Multiple Consumers can read from the same Partition simultaneously, When a Consumer reads, it will read data from the offset. Offset is basically like a timestamp of a message provided by the Producer saved in metadata. Consumers can either read from the beginning or read from a certain timestamp/offset.
Each Partition has one server called Leader(That’s one Kafka Broker which is in charge of serving that specific data), and sometimes more servers act as Followers. Leader handles all read/write requests for the Partition while Followers passively replicate Leader. If a Leader fails for some reason, one of the Followers automatically becomes the Leader. Therefore Leaders can change over time. Every Broker can serve as the Leader for some data. Each Broker is a Leader for some Partitions and acts as a Follower to other Partitions, providing load balance within the cluster. Zookeeper provides information on which is the Leader for a certain part of data.