Date: September 15, 2025

Topic: Document Semantics

Recall

Computers don’t understand text, only vectors

Notes

Semantics

Document Semantics

How to represent the meaning of a document?

Multi-hot representation

Multi-hot representation


While we can have 1 vector for each vocab, all words get an equal value. If we want to know whether 2 documents are related, we should find words that help distinguish docs.

Multi-hot Vectors

image.png

Differentiating Words

image.png


TF-IDF provides a more nuanced document representation than just pure multi-hot. By taking into account the frequency of words within and across documents, we can get a vector to represent the importance of each word in a document.

Term Frequency-Inverse Document Frequency

Term Frequency

image.png

Inverse Document Frequency

image.png

Using TF-IDF

image.png




<aside> 📌 SUMMARY: Representing documents may be done through one-hot vectors or from TF-IDF, vectors with varying scores for each word. TF-IDF is useful as it helps to find words that are important within documents while ignoring common words that appear in all documents. By having a query vector, we can multiply it with a stack of documents to get an array of cosine scores — where taking the argmax gives us the document of interest.

</aside>


Date: September 15, 2025

Topic: Word Embeddings