Million Query Track http://ir.cis.udel.edu/ million/
The Million Query Track serves two purposes.
First, it is an exploration of ad-hoc retrieval on a large collection
of documents. Second, it investigates questions of system evaluation,
particularly whether it is better to evaluate using many shallow
judgments or
fewer thorough judgments and whether small sets of judgments are
reusable. Participants in this track will run up to 40,000 queries against a large
collection of web documents at least once. These queries will be
classified by assessors as "precision-oriented" or "recall-oriented".
Participants can, if so motivated, try to determine what the query class
is and choose ranking algorithms specialized for each class.
The task is a standard ad hoc retrieval task, with the added feature that queries will be classified by hardness and by whether the user's intent is to find as much information about the topic as possible ("recall-oriented"), or to find one or a few highly-relevant documents about one particular aspect ("precision-oriented").
Here are the query types:
The task is a standard ad hoc retrieval task, with the added feature that queries will be classified by hardness and by whether the user's intent is to find as much information about the topic as possible ("recall-oriented"), or to find one or a few highly-relevant documents about one particular aspect ("precision-oriented").
Here are the query types:
- Navigational (precision-oriented): Find a specific URL or web page.
- Closed, directed information need (precision-oriented): Find a short, unambiguous answer to a specific question.
- Resource (precision-oriented): Locate a web-based resource or download.
- Open or undirected information need (recall-oriented): Answer an open-ended question, or find all available information about a topic.
- Advice (recall-oriented): Find advice or ideas regarding a general question or problem.
- List (recall-oriented): Find a list of results that will help satisfy an open-ended goal.
http://ir.cis.udel.edu/
Million Query is because it tries to answer responding to a search query whether the query is seeking depth in a topic (Precision) or breadth in the topic of the search (Recall). The participants can attribute whether answering the query is easy or hard and likewise select their responding algorithm. They also had prioritization to let participants decide which queries to resolve as not everybody had enough computation resources to index the whole corpus. Queries were sampled from query logs and anonymized and processed to make sure they are high volume.
Here is a good overview of the track: http://trec.nist.gov/pubs/ trec18/papers/MQ09OVERVIEW.pdf
----
They crawl the web using Heritrix, using around 2,000,000 seed URLs generated from previous clueweb tracks or manual additions (based on country, type, ...), with a blacklist of spams or pornography websites https://webarchive.jira.com/ wiki/display/Heritrix/ Heritrix;jsessionid= 3F691452AE94A7F2369B7D61C31C94 30
The duplication (16%) was due to a mistake they made in crawling. and supposedly are removed. http://www.lemurproject.org/ clueweb12.php/specs.php There are two categories: CatA: Entire corpus(1 billion pages), CatB: English(50 million English pages)
They crawl the web using Heritrix, using around 2,000,000 seed URLs generated from previous clueweb tracks or manual additions (based on country, type, ...), with a blacklist of spams or pornography websites https://webarchive.jira.com/
Queries and Annotations
As
part of the dataset is the freebase annotated queries. They thanked
Google for
the creation of Freebase annotations for the dataset, but didn't go into
detail as how it has been created. There are two categories of queries:
http://lemurproject.org/ clueweb12/TREC%20Freebase% 20Queries,%20v1.0.zip
TREC Web Topics (ClueWeb):
Since the number of annotations in these topics is fairly small, they
have been reviewed by human annotators, and the errors they found were
manually corrected. We estimate the number of remaining errors to be
under 1%. http://trec.nist.gov/data/web/ 09/wt09.topics.full.xml
2009 Million Query Track queries: The original dataset is described here: http://trec.nist.gov/data/
More on Annotation Details:
The annotation
process was automatic, and hence imperfect. However, the annotations are
of generally high quality, as we strived for high precision (and, by
necessity, lower recall). For each entity we recognize with high
confidence, we provide the beginning and end byte offsets of the entity
mention in the input text, its Freebase identifier (mid), and the
confidence level.
Sample input (from TREC topic 101, year 2009):
Sample input (from TREC topic 101, year 2009):
Find
information on President Barack Obama's family history, including
genealogy, national origins, places and dates of birth, etc.
Sample output (the fields are tab-separated):
Barack Obama3041/m/ 02mjmr0.99830526
Where:
- "Barack Obama" is the entity mention recognized in the input text.
- 30 and 41 are the beginning and end byte offsets of the entity mention in the input text.
- /m/02mjmr - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/ 02mjmr".
- 0.99830526 - confidence score of recognizing this particular entity.
- 30 and 41 are the beginning and end byte offsets of the entity mention in the input text.
- /m/02mjmr - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/
- 0.99830526 - confidence score of recognizing this particular entity.
Derived data from ClueWeb9:
WebGraph
Information on the web graph of nodes and oulinks for the dataset. The statistics for the web graph are as follows:
- Full Dataset:
- Unique URLs: 4,780,950,903 (325 GB uncompressed, 116 GB compressed)
- Total Outlinks: 7,944,351,835 (71 GB uncompressed, 29 GB compressed)
- TREC Category B (first 50 million English pages)
- Unique URLs: 428,136,613 (30 GB uncompressed, 9.7 GB compressed)
- Total Outlinks: 454,075,638 (3.8 GB uncompressed, 1.6 GB compressed)
Redirects
The redirects file is a plain text file with one redirect per line in the form of:
[Source URL] [Redirected URL] [Source IP] [Redirected IP]
[Source URL] [Redirected URL] [Source IP] [Redirected IP]
PageRank Scores
The basic PageRank algorithm with random start probability 0.15. The
WebGraphs are as provided with the collection. WebGraphs not only
include in-collection pages as nodes, but also all the outlinks from
those pages. For example, the category A English portion has about 500
million (503,860,525) pages, and the graph includes roughly 4.8 billion
(4,780,950,903) URLs/nodes.
Anchor Text Query Log http://ciir-publications.cs. umass.edu/pub/web/getpdf.php? id=900
Employing the techniques described in [Dang & Croft
10], a simulated query log has been constructed from the
anchor text from the ~500 million English documents in the ClueWeb09
collection. Query log, which basically includes queries submitted b y
users and documents from the result pages that have been clicked on to
view. In this paper, they suggest that anchor text, which is readily
available, can be an effective substitute for a query log and study the e
effectiveness of a range of query reformulation techniques (including
log-based stemming, substitution, and expansion) using standard TREC
collections
No comments:
Post a Comment