Date: September 15, 2025
Topic: Document Semantics
Recall
Computers don’t understand text, only vectors
Notes
Semantics
- Semantics are what words and phrases mean
- However, neural networks don’t know what words mean
- By extension, neural networks don’t know what a document is
- Words are just one-hot vectors to computers, not a lot of meaning
Document Semantics
How to represent the meaning of a document?

Multi-hot representation
- Can use multi-hot vectors
- However, some words are more important than others and this importance is not captured
While we can have 1 vector for each vocab, all words get an equal value. If we want to know whether 2 documents are related, we should find words that help distinguish docs.
Multi-hot Vectors

- Set a 1 in each position for every word in a document
- However, this results in all words getting an equal value (e.g., “star” vs. “the”)
Differentiating Words

- If we want to know whether 2 documents are related, we want to rely on words that are effective at distinguishing documents
- In the above, the word “franchise” seems like a good candidate as it only appears for 2 of the 3 texts
- Hence, some words are significant when it comes to identifying documents
TF-IDF provides a more nuanced document representation than just pure multi-hot. By taking into account the frequency of words within and across documents, we can get a vector to represent the importance of each word in a document.
Term Frequency-Inverse Document Frequency
- TF-IDF: Give weight to each word based on how important it is
Term Frequency

- If a word shows up frequently in a document, it must be more important than other words
- Want to “reward” words that show up a lot in a particular document
- Having $\log$ means that TF grows very slowly as the frequency of a word gets bigger
Inverse Document Frequency

- Words that are common in all documents are not very important
Using TF-IDF

- Can apply TF-IDF to every word $w$ in every document $d$ to get a vector
- The intensity of the vector thus shows how important a word is
<aside>
📌 SUMMARY:
Representing documents may be done through one-hot vectors or from TF-IDF, vectors with varying scores for each word.
TF-IDF is useful as it helps to find words that are important within documents while ignoring common words that appear in all documents.
By having a query vector, we can multiply it with a stack of documents to get an array of cosine scores — where taking the argmax gives us the document of interest.
</aside>
Date: September 15, 2025
Topic: Word Embeddings