UCR - CS 242 - Winter 2018
Instructions: Turnin paper in ilearn by 2/19. This is individual assignment.
Consider the following document D, taken from a collection C.
"The University of California, Riverside is one of 10 universities within the prestigious University of California system, and the only UC located in Inland Southern California. Widely recognized as one of the most ethnically diverse research universities in the nation."
Consider the following two queries:
Q1: university Riverside
Q2: diverse university
Characteristics of collection C are as follows:
# docs in collection C: 1000
# docs in C that contain "Riverside": 100
# docs in C that contain "university/ies": 200
# docs in C that contain "diverse": 150
Compute the scores of Q1 and Q2 for D, using (a) BM25, and (b) Unigram Language Model (with smoothing method of your choice). Make and state any assumptions necessary, e.g., about the constants in BM25.
Compute the PageRank score of each node in the graph below. Show your work.
In how many iterations does the computation converge?
Show how MapReduce can be used to efficiently solve the following problem:
Given a collection of input documents, output all pairs of keywords that co-occur in at least 1000 of the documents.
Write pseudocode for map and reduce functions.
Full points for most efficient implementation.
Hint: is multi-phase MapReduce useful here?
For a specific query Q, suppose that a search engine can produce up to 3 results, where the i-th result has probability 1/(2i) of being relevant. That is, 1st result has probability 1/2, 2nd has 1/4, 3rd has 1/6, and so on. Also, assume Q has a total of 3 relevant results in the collection.
C1: What is the expected average precision (AP) if the engine outputs 2 results?
C2: How many results should the search engine output to maximize the expected AP? Show your calculations and results.
C3: How many results should the search engine output to maximize F (harmonic mean of precision and recall)? Show your calculations and results.