August 24, 2013

Push the Limits of Big Data

You know you are working on big data when you reach the capacity of the giant text indexer Lucene. Apparently Lucene has a maximum number of documents that can be processed. Wen you are processing over two million compressed files each of which is composed of up to thousands of HTML files things tend to go wrong.

Checkout below
http://stackoverflow.com/questions/10247309/solr-exception-docid-must-be-0-and-maxdoc-20
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.1.0/org/apache/lucene/index/BaseCompositeReader.java

So the problem is: we are stuck
and the deadline for the project is in three days! Solution, switch to plan B, go for smaller indexes that we had before merging them and hope for the best. Have smaller indexes but have multiple ones!

We will have more than 12,000 index files that need to be queried for 170 queries (potentially 170*10 but due to the shortage of time we settle for less to try to get ready asap.)

---------------
This is the details of files in the corpus that we are processing: Each streamitem is an HTML file NLP annotated using lingpipe stored as thrift objects. Each chunk file is an encrypted GnuPG file: .xz.gpg

StreamItemschunk filesSubstream
126,95211,851arxiv (full text, abstracts in StreamItem.other_content)
394,381,405688,974social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id)
134,933,117280,658news (reprocessed from kba-stream-corpus-2012, same stream_id)
5,448,87512,946linking (reprocessed from kba-stream-corpus-2012, same stream_id)
396,863,627927,257WEBLOG (spinn3r)
57,391,714164,160MAINSTREAM_NEWS (spinn3r)
36,559,57885,769FORUM (spinn3r)
14,755,27836,272CLASSIFIED (spinn3r)
52,4129,499REVIEW (spinn3r)
7,6375,168MEMETRACKER (spinn3r)
1,040,520,5952,222,554Total

No comments:

Post a Comment