Date: October 11, 2025
Topic: Classical Information Retrieval
Recall
Information retrieval is the task of obtaining relevant documents for an information need from text sources.
We present a query to an engine to retrieve the documents.
Notes
Classical Information Retrieval
- Searching for relevant information to satisfy an information need (usually from a large collection of texts like the Web)
- Most defining task of the information age
Retrieval Paradigms
- Boolean retrieval
- Ranked retrieval
- Vector-space models
- Probabilistic IR
Task

- With a fixed set of documents:
- Example user info needs: “Find out when my NAACL paper is due”
- Example query: “naacl dates” — the system only sees this
A document is relevant if it satisfies the user’s original info need. However this can be ambiguous as we usually don’t have the original need and just the query.
Precision is the fraction of relevant documents over all retrieved documents. Recall is the fraction of relevant documents in the retrieved set over all relevant documents in the entire dataset.
Boolean Retrieval
- A document $d$, given query $q$, is relevant if it satisfies the user’s information need, otherwise not relevant
- Relevance is defined with respect to the original info need and only indirectly relates to the query
- E.g., a document with “naacl” and “dates” can be irrelevant if the actual paper deadlines are not present
- Relevance is ambiguous as we don’t have direct access to the original need — what date? which year?
Precision and Recall

Result set contains 5 documents (read from left)
- For a set of results documents, $R=\{d_1, d_2,...,d_n\}$
- Precision: Of all docs in $R$, what fraction is relevant
- Out of the 5 retrieved documents, 2 are relevant
- Recall: Of all relevant documents, what fraction appears in $R$
- Out of the 4 relevant documents in the entire dataset, we managed to get 2 relevant ones from the search
We retrieve texts using a term-document incidence matrix, where the columns indicate the documents and the rows indicate the important terms in each document.
Then we can just apply bitwise operations when doing a document search.
Implementing Text Retrieval
- How to search for the terms (”naacl”, “dates”) in documents?
- Looking through all documents is slow
- How should we return the documents?
Term-Document Incidence Matrix

- This data structure stores the pre-computation and term inclusions
- Each column is a document, and each row is a term/word
- Entries in the matrix are binary, if a term appears in a specific document, then we set that index to $1$, otherwise $0$
- To find documents that contain particular terms, we can just do bitwise operations on the terms
Problems
- Term-document matrix can be very large for moderate size collections
<aside>
📌 SUMMARY:
Boolean retrieval uses text retrieval methods like the Term-Document Incidence Matrix to perform retrieval by checking if a document has the required words or not
This is implemented by using Inverted Index, where each document becomes a bag of words with sorted indices
For phrase queries like “Red Hot Chili Peppers”, we can use Positional Index to store the index of each position that a term appears in the document, allowing for phrase queries and proximity searches
</aside>
Date: October 12, 2025
Topic: Ranked Retrieval