Iterative Dichotomiser 3 (ID3)

Date: May 19, 2024

Topic: ID3 Algorithm

Recall

Notes

ID3 Algorithm

LOOP:

select_best_attribute_A()
assign_A_as_decision_attribute_for_node()

for each_value_of_A:
	create_descendent_node()
	
sort_training_examples_to_leaves()

if examples_perfectly_created:
	break
else:
	iterate_over_leaves()

Entropy measures randomness in a dataset

The best attribute for each node maximizes the Entropy Gain

Selecting the best attribute (Information Gain)

The amount of information gained by picking a particular attribute
We seek to reduce randomness in the data by knowing the value of a particular attribute

Formula

Information Gain:

$\text{Gain}(S,A) = \text{Entropy}(S)-\sum_V\frac{|S_v|}{|S|}\cdot\text{Entropy}(S_V)$

$S$ is the collection of training examples
$A$ is the particular attribute
For a specific value $S_v$, we get an entropy value as well
Info Gain = Entropy(Parent) - Entropy(After Split)

Entropy

Measure of randomness.
Maximum entropy occurs when the outcome is equally distributed (e.g., flipping a coin that can be heads or tails))
Minimum entropy occurs when we know the outcome (e.g., flipping a loaded coin that is always heads)
If we have objects that are more common in one label than another, then entropy decreases
- A bag with 50 red and 50 green marbles has higher entropy
- A bag with 90 red and 10 green marbles has lower entropy

<aside> 💡 In order to choose the best attribute, we want one to maximize the Information Gain

</aside>

For a decision tree to perform well, it has to consider inductive biases in its algorithm

Inductive Bias

Assumptions or prior knowledge that an ML algorithm uses to learn and generalize from training data
Consists of restriction bias and preference bias

Restriction Bias ($H$)

Hypotheses set that we care about (e.g., for lin reg, we only consider all linear functions)
For decision trees, this includes all possible decision trees

Preference Bias ($n\in H$)

What source of hypothesis from the hypotheses set we prefer

Applying inductive bias to decision trees

Tend to produce trees with good splits at the top
Correct over incorrect splits
Prefer shorter trees to longer trees (due to good splits at top)

<aside> 📌 SUMMARY: When building a decision tree, we want to obtain the best splits for the attributes and this can be done by maximizing entropy gain

</aside>