July 7, 2018

What is Apache Kafka? What is the use?

It's been a while I have heard a lot about Apache Kafka. I have read a lot of material and still have no idea what it is good for. Here I write down the gist of some of my findings and hope it helps someone along the way:


Apache Kafka
  • Fast, scalable, and fault-tolerant. much faster and resource savvy than DBs, Casandra, HDFS, ...
  • Publish-subscribe messaging system between producers and consumers based on topics. 
  • Used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.  
  • A platform for high-end new generation distributed applications.
  • It can work in combination with Apache Storm, Apache H Base and Apache Spark for real-time analysis and rendering of streaming data. Integrate Flume with Kafka Kafka can message geospatial data from a fleet of long-haul trucks or sensor data from heating and cooling equipment in office buildings. 
Though there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Moreover those technologies were not suitable for real-time processing.

Whatever the industry or use case, Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop.

Common use cases include:
  1. Stream Processing 
  2. Website Activity Tracking 
  3. Metrics Collection and Monitoring 
  4. Log Aggregation. 
  5. LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events. Since none of the solutions were available to deal with this drawback, Kafka was developed in the year 2010 as a solution to this problem.
Basically you go from this (hand connected nodes):

 To this (decoupled):


What you sacrifice is:
  • Ordering: each topic (kind of message) is written to multiple partitions. If you want strict ordering you can use ONE partition for your topic and you get strict ordering! Just like a DB! Just keep in mind to update the configs. Also, try to observe where in your business logic is parallelizable: actions on cart of one customer needs ordering, but carts of different customers can be put into different partitions. So you can use multiple partitions and maintain parallelism.
  • Persistence: Kafka stores to disk but discards after default 7 days. So if you want to persist you can write to file/db, etc. New York Times famously uses infinite timing and has kept all its data in Kafka since 1800s.
Database on the other hand has a single point of entry even though it might have many replications; and there is a lot of house keeping to do. updating indexes, log trees, b-tree structures, replications, etc. and can become single point of failure/bottleneck

https://www.youtube.com/watch?v=hyJZP-rgooc
https://www.youtube.com/watch?v=1vLMuWsfMcA

Note: Each partition replicas has a lead and the rest are follower partitions. the lead is the entry point

No comments:

Post a Comment