CS 242: Information Retrieval & Web Search

Winter 2024

General Info

Instructor: Vagelis Hristidis

Description: Description: Description: Description: Description: Description: Description: Description: U:\public_html\email.JPG

Lecture time: M 5:00-6:20 pm, W 5:30-6:50 pm

Location: Room A125

Office hour: Wednesday 4:30-5:30 pm (WCH 317)

Shihab Rashid

Office hours:
Shihab: Wednesdays 1:00 - 2:00 PM, WCH 363
Meem, Thursdays 1:00-2:00 pm, WCH 363

Reader: Pooja Patil


25% quizzes (worst 2 quizzes will be discarded, MSOL students will have until the weekend to take the quiz; others will take the quiz during the lecture time)

25% midterm

15% assignment

35% project

Course Description

Information Retrieval (IR) principles including indexing and searching document collections, Web search and advanced topics like deep learning and search in social networks.

Some of the topics which will be tentatively presented are:





Late submissions, submitted before assignments or projects are graded, will receive a 20% score reduction.


Tentative Lectures' Schedule



Book Chapters

supplemental material for further reading
Jan 8

Class Overview, Overview of Information Retrieval and Search Engines

Ch. 1, 2

slides Ch. 1 and 2

Jan 10, 17

Ranking: Vector space model, Probabilistic Model, Language model

Ch 7.1, 7.2, 7.3 (except 7.3.2)
slides Ch. 7


 Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference (SIGIR'06), pages 186-193, 2006. (pdf)
Jan 22 Hands-on Scrapy and Lucene (by TA) Slides  
Jan 24, 29 Crawling, Storing Ch. 3, slides Ch. 3 (p1) Heydon, A. and Najork, M. 1999.Mercator: A scalable, extensible Web crawlerWorld Wide Web 2, 4 (Apr. 1999), 219-229. (slides)
Jan 31, Feb 5 Indexing, MapReduce, Query Processing Ch. 5 (except 5.4.2-5.4.7, 5.7.4-5.7.5), slides Ch. 5 (p2) R. Fagin, Amnon Lotem and Moni Naor. Optimal aggregation algorithms for middleware J. Computer and System Sciences 66 (2003), pp. 614-656. Extended abstract appeared in Proc. 2001 ACM Symposium on Principles of Database Systems (PODS '01), pp. 102-113
(p6) Jeffrey Dean and Sanjay Ghemawat.MapReduce: simplified data processing on large clusters. OSDI 2004
Feb 7, 12 Link Analysis

Ch. 4.5

slides: link-based search

(p4) L. Page, S. Brin, R. Motwani, T.Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1999

(p5) J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM 46(1999).

Feb 14 Hands-on BERT with PyTorch and Faiss (by TA) slides https://github.com/facebookresearch/faiss/wiki/Getting-started 
Feb 21


Ch. 8, slides Ch. 8-short

 (p3) R. Fagin, Ravi Kumar and D.Sivakumar: Comparing top-k lists. SIAM J. Discrete Mathematics 17, 1 (2003)
Feb 26

Review session


Feb 28 MIDTERM    
Mar 4 , 6 Deep learning and IR Deep Learning in IR Lin, Jimmy, Rodrigo Nogueira, and Andrew Yates. "Pretrained transformers for text ranking: Bert and beyond." Synthesis Lectures on Human Language Technologies 14, no. 4 (2021): 1-325.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Mar 11 no class    
Mar 13

Project Presentations 5-7:00 pm at SSC 229



Interesting topics but no time to present in class

Ad words online advertising (ASU slides)  

Relational DB and XML Search

1.    IR and DB

(p13) Sara Cohen, Jonathan Mamou,Yaron Kanza, Yehoshua Sagiv: XSEarch: A Semantic Search Engine for XML. 45-56, VLDB 2004

(p14) L. Guo, F. Shao, C. Botev, J.Shanmugasundaram: XRANK: Ranked Keyword Search over XML Documents. SIGMOD 2003

Web Search: Spam, topic-specific pagerank

1.    text classification

2.    Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (WWW '06)

3.    Taher H. Haveliwala, "Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search," IEEE Transactions on Knowledge and Data Engineering, vol. 15,  no. 4,  pp. 784-796,  Jul/Aug,  2003.

Text Processing, Query Refinement, Results Presentation (snippets), Social Search, Question Answering,
web search advertising (if time)
Ch. 4.1, 4.2, 4.3, slides Ch. 4,
Ch. 6.1, 6.2, 63, slides Ch. 6

Ch 10, slides Ch. 10

G. Salton, C Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 1990

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781,2013.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Pro-
ceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119, 2013.

Other Resources

writing tips

presentation tips



Free download at https://ciir.cs.umass.edu/downloads/SEIRiP.pdf

Search Engines: Information Retrieval in Practice

Bruce Croft, Donald Metzler, Trevor Strohman

Addison Wesley; 1 edition

ISBN-10: 0136072240

ISBN-13: 978-0136072249



Also recommended for reference:



Academic Integrity: https://conduct.ucr.edu/