Checkout below
http://stackoverflow.com/questions/10247309/solr-exception-docid-must-be-0-and-maxdoc-20
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/4.1.0/org/apache/lucene/index/BaseCompositeReader.java
So the problem is: we are stuck
and the deadline for the project is in three days! Solution, switch to plan B, go for smaller indexes that we had before merging them and hope for the best. Have smaller indexes but have multiple ones!
We will have more than 12,000 index files that need to be queried for 170 queries (potentially 170*10 but due to the shortage of time we settle for less to try to get ready asap.)
---------------
This is the details of files in the corpus that we are processing: Each streamitem is an HTML file NLP annotated using lingpipe stored as thrift objects. Each chunk file is an encrypted GnuPG file: .xz.gpg
StreamItems | chunk files | Substream |
---|---|---|
126,952 | 11,851 | arxiv (full text, abstracts in StreamItem.other_content) |
394,381,405 | 688,974 | social (reprocessed from kba-stream-corpus-2012 plus extension, same stream_id) |
134,933,117 | 280,658 | news (reprocessed from kba-stream-corpus-2012, same stream_id) |
5,448,875 | 12,946 | linking (reprocessed from kba-stream-corpus-2012, same stream_id) |
396,863,627 | 927,257 | WEBLOG (spinn3r) |
57,391,714 | 164,160 | MAINSTREAM_NEWS (spinn3r) |
36,559,578 | 85,769 | FORUM (spinn3r) |
14,755,278 | 36,272 | CLASSIFIED (spinn3r) |
52,412 | 9,499 | REVIEW (spinn3r) |
7,637 | 5,168 | MEMETRACKER (spinn3r) |
1,040,520,595 | 2,222,554 | Total |
No comments:
Post a Comment