Morteza Shahriari Nia: Push the Limits of Big Data

August 24, 2013

Push the Limits of Big Data

You know you are working on big data when you reach the capacity of the giant text indexer Lucene. Apparently Lucene has a maximum number of documents that can be processed. Wen you are processing over two million compressed files each of which is composed of up to thousands of HTML files things tend to go wrong.

Checkout below
http://stackoverflow.com/questions/10247309/solr-exception-docid-must-be-0-and-maxdoc-20
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.1.0/org/apache/lucene/index/BaseCompositeReader.java

So the problem is: we are stuck
and the deadline for the project is in three days! Solution, switch to plan B, go for smaller indexes that we had before merging them and hope for the best. Have smaller indexes but have multiple ones!

We will have more than 12,000 index files that need to be queried for 170 queries (potentially 170*10 but due to the shortage of time we settle for less to try to get ready asap.)

---------------
This is the details of files in the corpus that we are processing: Each streamitem is an HTML file NLP annotated using lingpipe stored as thrift objects. Each chunk file is an encrypted GnuPG file: .xz.gpg

StreamItems	chunk files	Substream
126,952	11,851	arxiv (full text, abstracts in StreamItem.other_content)
394,381,405	688,974	social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id)
134,933,117	280,658	news (reprocessed from kba-stream-corpus-2012, same stream_id)
5,448,875	12,946	linking (reprocessed from kba-stream-corpus-2012, same stream_id)
396,863,627	927,257	WEBLOG (spinn3r)
57,391,714	164,160	MAINSTREAM_NEWS (spinn3r)
36,559,578	85,769	FORUM (spinn3r)
14,755,278	36,272	CLASSIFIED (spinn3r)
52,412	9,499	REVIEW (spinn3r)
7,637	5,168	MEMETRACKER (spinn3r)
1,040,520,595	2,222,554	Total

Morteza Shahriari Nia

August 24, 2013

Push the Limits of Big Data

No comments:

Post a Comment