How Decision Trees Pick the Right Question to Ask: The Gini Index Explained

decision-trees gini-index information-gain classification supervised-learning

A decision tree classifies data the same way a doctor does a differential diagnosis: by asking a sequence of yes/no questions, narrowing down possibilities with each answer until reaching a conclusion. What makes the algorithm interesting isn't the structure (a flowchart), it's the question: how does it know which question to ask first?

The tree is hierarchical. It starts at a root node: the first question. Each question splits the data into branches. The process recurses until you reach leaf nodes: terminal nodes that give a final prediction. The algorithm at each node has to pick the feature and threshold that creates the most useful split.

Two metrics measure "most useful":

Information Gain measures how much a split reduces entropy (disorder). High entropy means the data is a mix of classes. After a good split, each branch should be purer. $\text{Information Gain} = \text{entropy before} - \text{weighted entropy after}$ .

Gini Index measures the probability that a randomly picked element would be misclassified if it were randomly labeled by the node's distribution:

\text{Gini} = 1 - \sum_i P_i^2

where $P_i$ is the proportion of class $i$ in the node. Gini of 0 means perfectly pure. Gini of 0.5 means maximum impurity.