June 17, 2013

Million Query Track - Knowledgebase Acceleration

Million Query Track http://ir.cis.udel.edu/million/

The Million Query Track serves two purposes.  First, it is an exploration of ad-hoc retrieval on a large collection of documents.  Second, it investigates questions of system evaluation, particularly whether it is better to evaluate using many shallow judgments or fewer thorough judgments and whether small sets of judgments are reusable. Participants in this track will run up to 40,000 queries against a large collection of web documents at least once. These queries will be classified by assessors as "precision-oriented" or "recall-oriented". Participants can, if so motivated, try to determine what the query class is and choose ranking algorithms specialized for each class.

The task is a standard ad hoc retrieval task, with the added feature that queries will be classified by hardness and by whether the user's intent is to find as much information about the topic as possible ("recall-oriented"), or to find one or a few highly-relevant documents about one particular aspect ("precision-oriented").
Here are the query types:

  • Navigational (precision-oriented): Find a specific URL or web page.
  • Closed, directed information need (precision-oriented): Find a short, unambiguous answer to a specific question.
  • Resource (precision-oriented): Locate a web-based resource or download.
  • Open or undirected information need (recall-oriented): Answer an open-ended question, or find all available information about a topic.
  • Advice (recall-oriented): Find advice or ideas regarding a general question or problem.
  • List (recall-oriented): Find a list of results that will help satisfy an open-ended goal.


Million Query is because it tries to answer responding to a search query whether the query is seeking depth in a topic (Precision) or breadth in the topic of the search (Recall). The participants can attribute whether answering the query is easy or hard and likewise select their responding algorithm. They also had prioritization to let participants decide which queries to resolve as not everybody had enough computation resources to index the whole corpus. Queries were sampled from query logs and anonymized and processed to make sure they are high volume.

Here is a good overview of the track: http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf

The duplication (16%) was due to a mistake they made in crawling. and supposedly are removed. http://www.lemurproject.org/clueweb12.php/specs.php There are two categories: CatA: Entire corpus(1 billion pages), CatB: English(50 million English pages)

They crawl the web using Heritrix, using around 2,000,000 seed URLs generated from previous clueweb tracks or manual additions (based on country, type, ...), with a blacklist of spams or pornography websites https://webarchive.jira.com/wiki/display/Heritrix/Heritrix;jsessionid=3F691452AE94A7F2369B7D61C31C9430
Queries and Annotations
As part of the dataset is the freebase annotated queries. They thanked Google for the creation of Freebase annotations for the dataset, but didn't go into detail as how it has been created. There are two categories of queries: http://lemurproject.org/clueweb12/TREC%20Freebase%20Queries,%20v1.0.zip

TREC Web Topics (ClueWeb): Since the number of annotations in these topics is fairly small, they have been reviewed by human annotators, and the errors they found were manually corrected. We estimate the number of remaining errors to be under 1%. http://trec.nist.gov/data/web/09/wt09.topics.full.xml

2009 Million Query Track queries: The original dataset is described here: http://trec.nist.gov/data/million.query09.html and contains 40,000 queries. This dataset is much bigger than the TREC topics, and it contains 40,000 queries. It was not possible to verify all the automatic annotations manually, but based on a small sample we believe the error rate to be under 3%. The errors identified in this small sample were not corrected to keep the data consistent.

More on Annotation Details:
The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as we strived for high precision (and, by necessity, lower recall). For each entity we recognize with high confidence, we provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and the confidence level.

Sample input (from TREC topic 101, year 2009):
Find information on President Barack Obama's family history, including genealogy, national origins, places and dates of birth, etc.

Sample output (the fields are tab-separated):
Barack Obama3041/m/02mjmr0.99830526
- "Barack Obama" is the entity mention recognized in the input text.
- 30 and 41 are the beginning and end byte offsets of the entity mention in the input text.
- /m/02mjmr - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/02mjmr".
- 0.99830526 - confidence score of recognizing this particular entity.
Derived data from ClueWeb9:
Information on the web graph of nodes and oulinks for the dataset. The statistics for the web graph are as follows:
  • Full Dataset:
    • Unique URLs: 4,780,950,903 (325 GB uncompressed, 116 GB compressed)
    • Total Outlinks: 7,944,351,835 (71 GB uncompressed, 29 GB compressed)
  • TREC Category B (first 50 million English pages)
    • Unique URLs: 428,136,613 (30 GB uncompressed, 9.7 GB compressed)
    • Total Outlinks: 454,075,638 (3.8 GB uncompressed, 1.6 GB compressed)
The redirects file is a plain text file with one redirect per line in the form of:
[Source URL] [Redirected URL] [Source IP] [Redirected IP]

PageRank Scores
The basic PageRank algorithm with random start probability 0.15. The WebGraphs are as provided with the collection. WebGraphs not only include in-collection pages as nodes, but also all the outlinks from those pages. For example, the category A English portion has about 500 million (503,860,525) pages, and the graph includes roughly 4.8 billion (4,780,950,903) URLs/nodes.

Employing the techniques described in [Dang & Croft 10], a simulated query log has been constructed from the anchor text from the ~500 million English documents in the ClueWeb09 collection. Query log, which basically includes queries submitted b y users and documents from the result pages that have been clicked on to view. In this paper, they suggest that anchor text, which is readily available, can be an effective substitute for a query log and study the e effectiveness of a range of query reformulation techniques (including log-based stemming, substitution, and expansion) using standard TREC collections

No comments:

Post a Comment