Morteza Shahriari Nia: Twitter Analytics Infrastructure: Parquet, Storm

October 14, 2013

Twitter Analytics Infrastructure: Parquet, Storm - Analytics @ WebScale

I was watching a talk by Dmitriy Ryaboy who is the engineering manager on Twitter's analytics infrastructure team. The talk was about Twitter Analytics Infrastructure like Parquet, Storm,... at the Analytics @ WebScale conference and making sure they provide the right and optimized tools /infrastructure for different teams there to use .

===
So let's get dirty:
Logs are available in HDFS and stored in Apache Thrift
object format, everything is clean and linked with nested logs. They relate to logs as table rows and have 7 attributes for a log which they refer to as columns. Storage is in column-storage[particular type encodings compressions, ...] (vs row storage first) and further they divide it into chunks for compression using gzip. Everything is format in their file format along with some meta data of the content of teh file, maybe some indexes etc. The format is opensourced.

Parquet: Columnar Storage for Hadoop

On data side, there are two paths to ge tdata, hadoop and storm. if data is available by hdoop we use it otherwise we use a sample of data provided by storm and fix the missing pieces later on when hadoop is finished.

Twitter storm: you have data queues and you have workers. so workers process from arbitrary queues and output to other queues for other wokers daemons to work on.
bolt: transforms data tuples (e.g. json): merge, aggregate, filter, etc
spout: data source of streams: e.g. stream of http requests, stream of tweets
topology is a directed acyclic graph of spouts and bolts.

master node is a thrift service that you send your topology to and comes up with worker assignment specified in zookeeper. if master goes away the system does not need it and keeps working as specified.