Decision trees conceive nodes that split into either left or right depending on the splitval of that node.

This split is learned from the data, where factors with the highest gini coefficient is used to split the tree first.

Decision trees in Python can be defined by matrices, with each row being the node used to split the tree further down.

This can be done with a numpy array

Decision Trees

In the context of animals

Factors: What we use to make each decision (size, fur, etc)
Labels: What the prediction is (giraffe, fish, etc)
Nodes: Factors used, split value for going to child nodes (left and right link)
Root: First node in tree
Leaves: Final nodes at the bottom of tree

Untitled

The above shows a regression tree, with up to 11 factors
At each point, if the value is less than a threshold we move to the left child
If the value is more, we move to the right child
It is also possible to have 2 nodes with the same factor, or not have the factor as a node

If we do in-sample testing, we will get the exact value of $y$-train as $y$-predict
With out-of-sample testing, the prediction and test results may be different

Build decision trees using a matrix or array method
For the above tree, we can construct a matrix as such (tabular view):

Using absolute indexing

Using relative indexing
For the right node, it can be either absolute or relative (for e.g., in node 1, the right child is at node 5 or node (1+4=5))
If it is a leaf node, then the factor will be -1 and the left and right nodes is set as -1 (terminal node)
The left and right column values are references into the rows of the matrix

Querying is at $\log(n)$ since it has a binary split at every node (e.g., for 1000 elements, the depth is 10 for a balanced tree)
However the learning process is slow

Determine the best feature to split on

Get the median value based on that feature for the splitval

Randomly select a feature to split on

Randomly select 2 rows and use the mean from that feature for splitval

<aside> 📌 SUMMARY:

</aside>