Semantics

Semantics are what words and phrases mean
However, neural networks don’t know what words mean
- By extension, neural networks don’t know what a document is
- Words are just one-hot vectors to computers, not a lot of meaning

Document Semantics

How to represent the meaning of a document?

Multi-hot representation

Can use multi-hot vectors
However, some words are more important than others and this importance is not captured

While we can have 1 vector for each vocab, all words get an equal value. If we want to know whether 2 documents are related, we should find words that help distinguish docs.

Multi-hot Vectors

Set a 1 in each position for every word in a document
However, this results in all words getting an equal value (e.g., “star” vs. “the”)

Differentiating Words

If we want to know whether 2 documents are related, we want to rely on words that are effective at distinguishing documents
In the above, the word “franchise” seems like a good candidate as it only appears for 2 of the 3 texts
- Hence, some words are significant when it comes to identifying documents

TF-IDF provides a more nuanced document representation than just pure multi-hot. By taking into account the frequency of words within and across documents, we can get a vector to represent the importance of each word in a document.

Term Frequency-Inverse Document Frequency

TF-IDF: Give weight to each word based on how important it is

Term Frequency

If a word shows up frequently in a document, it must be more important than other words
Want to “reward” words that show up a lot in a particular document
Having $\log$ means that TF grows very slowly as the frequency of a word gets bigger

Inverse Document Frequency

Words that are common in all documents are not very important

Using TF-IDF

Can apply TF-IDF to every word $w$ in every document $d$ to get a vector
The intensity of the vector thus shows how important a word is

<aside> 📌 SUMMARY: Representing documents may be done through one-hot vectors or from TF-IDF, vectors with varying scores for each word. TF-IDF is useful as it helps to find words that are important within documents while ignoring common words that appear in all documents. By having a query vector, we can multiply it with a stack of documents to get an array of cosine scores — where taking the argmax gives us the document of interest.

</aside>

Date: September 15, 2025

Topic: Document Semantics

Recall

Notes