Date: October 11, 2025

Topic: Classical Information Retrieval

Recall

Information retrieval is the task of obtaining relevant documents for an information need from text sources.

We present a query to an engine to retrieve the documents.

Notes

Classical Information Retrieval

Searching for relevant information to satisfy an information need (usually from a large collection of texts like the Web)
Most defining task of the information age

Retrieval Paradigms

Boolean retrieval
Ranked retrieval
Vector-space models
Probabilistic IR

Task

With a fixed set of documents:
- Example user info needs: “Find out when my NAACL paper is due”
- Example query: “naacl dates” — the system only sees this

A document is relevant if it satisfies the user’s original info need. However this can be ambiguous as we usually don’t have the original need and just the query.

Precision is the fraction of relevant documents over all retrieved documents. Recall is the fraction of relevant documents in the retrieved set over all relevant documents in the entire dataset.

Boolean Retrieval

A document $d$, given query $q$, is relevant if it satisfies the user’s information need, otherwise not relevant
Relevance is defined with respect to the original info need and only indirectly relates to the query
- E.g., a document with “naacl” and “dates” can be irrelevant if the actual paper deadlines are not present
- Relevance is ambiguous as we don’t have direct access to the original need — what date? which year?

Precision and Recall

Result set contains 5 documents (read from left)

For a set of results documents, $R=\{d_1, d_2,...,d_n\}$
Precision: Of all docs in $R$, what fraction is relevant
- Out of the 5 retrieved documents, 2 are relevant
Recall: Of all relevant documents, what fraction appears in $R$
- Out of the 4 relevant documents in the entire dataset, we managed to get 2 relevant ones from the search

We retrieve texts using a term-document incidence matrix, where the columns indicate the documents and the rows indicate the important terms in each document.

Then we can just apply bitwise operations when doing a document search.

Implementing Text Retrieval

How to search for the terms (”naacl”, “dates”) in documents?
Looking through all documents is slow
How should we return the documents?

Term-Document Incidence Matrix

This data structure stores the pre-computation and term inclusions
Each column is a document, and each row is a term/word
Entries in the matrix are binary, if a term appears in a specific document, then we set that index to $1$, otherwise $0$
To find documents that contain particular terms, we can just do bitwise operations on the terms

Problems

Term-document matrix can be very large for moderate size collections

<aside> 📌 SUMMARY: Boolean retrieval uses text retrieval methods like the Term-Document Incidence Matrix to perform retrieval by checking if a document has the required words or not This is implemented by using Inverted Index, where each document becomes a bag of words with sorted indices For phrase queries like “Red Hot Chili Peppers”, we can use Positional Index to store the index of each position that a term appears in the document, allowing for phrase queries and proximity searches

</aside>