Skip to main content

Decision tree 決策樹

Decision Stump 樹樁

Decision tree, depth <= 1

Choose splitting attribute, use majority voting to determine class(𝑡)

Problem: Overfitting, noise sensitivity (if the feature chosen is noisy or irrelevant)

Solution: Build a decision tree instread, by recursively split (divide and conquer)

How to find good splitting attribute?

Minimizes error rate of decision stump by A

Misclassification (Error Rate)

MisclassA(D)=jDjD(1maxkpkj)=1jDjDmaxkpkj\begin{align} \text{Misclass}_A(D) &= \sum_j \frac{|D_j|}{|D|} \left( 1 - \max_k p_{k|j} \right) \\ &= 1 - \sum_j \frac{|D_j|}{|D|} \max_k p_{k|j} \end{align}
  • DD: Set of data points at a particular node.

  • DjD_j: Set of data points in particular branch or leaf node j

  • D|D|: Total number of data points in the dataset DD

  • Dj|D_j|: Number of data points in the subset DjD_j.

  • pkjp_{k|j}: The probability that a data point in subset DjD_j belongs to class kk. It is the conditional probability of class kk given subset DjD_j.

  • maxkpkj\max_k p_{k|j}: The maximum probability among all classes kk for the subset DjD_j. It indicates the probability of the most likely class for the data points in DjD_j.

  • MisclassA(D)Misclass_A(D): The misclassification rate for the dataset DD. It measures the proportion of data points that are incorrectly classified.

missclass

For the above example, the misclassification rate for the dataset DD is 1/4(11)+3/4(12/3)=1/41/4(1-1) + 3/4(1-2/3) = 1/4

Problem: Choose the splitting attribute that minimizes the misclassification rate may noy be Global Optimal. (Issue of greedy algorithm, a.k.a locally optimal) It neglects the distribution of the class values of misclassified instances. It only considers the majority class.

Solution: Use other metrics like Gini index, entropy, or information gain to find the best splitting attribute

Classification

Decision Tree Classification Clearly Explained! - YouTube

Entropy

Other name like Info, same meaning

Entropy=pi log2(pi)\text{Entropy} = \sum - p_i \text{ } log_2(p_i) Info(D)=pi log2(pi)\text{Info}(D) = \sum - p_i \text{ } log_2(p_i)

If only one class, - 1 log2(1) = 0

For split attribute A

InfoA(D)=DjDInfo(Dj)\text{Info}_A(D) = \sum \frac{|D_j|}{|D|} \text{Info}(D_j)

Like Gini impurity, it quantifies the uncertainty or disorder in a dataset

Gini

Gini(D)=pi(1pi)=1pi2\begin{align} \text{Gini}(D) &= \sum p_i (1 - p_i) \\ &= 1 - \sum p_i^2 \end{align}

For split attribute A

GiniA(D)=DjDGini(Dj)\text{Gini}_A(D) = \sum \frac{|D_j|}{|D|} \text{Gini}(D_j)

Why It Works?

Comparing the Gini impurity to maxkpkj\max_k p_{k|j}, which only care about the majority class, the Gini impurity is more sensitive to changes in the class distribution. It penalizes the nodes with a higher number of classes.

Comparison: Entropy vs. Gini

AspectEntropyGini Impurity
DefinitionMeasures the uncertainty or disorder in the class distribution.Measures the probability of misclassification of a random sample.
FormulaEntropy(D)=pilog2(pi)\text{Entropy}(D) = \sum -p_i \log_2(p_i)Gini(D)=1pi2\text{Gini}(D) = 1 - \sum p_i^2
Range00 (perfect purity) to log2(C)\log_2(C) (maximum disorder for CC classes).00 (perfect purity) to 11C1 - \frac{1}{C} (maximum impurity for CC classes).
BehaviorMore sensitive to changes in class probabilities, especially near 00 or 11.Less sensitive to changes in class probabilities.
Computational CostSlightly higher due to the use of logarithms.Lower, as it involves only squares of probabilities.
InterpretationBased on information theory (bits of information).Based on probability of misclassification.
Use in Decision TreesBoth are commonly used; entropy is more popular in algorithms like ID3 and C4.5, while Gini is used in CART.Simpler and faster to compute, often preferred in practice.

Information Gain (IG)

Other name like Gain, same meaning

wiw_i is weight

Information Gain=Entropy(parent)wi Entropy(childi)\text{Information Gain} = \text{Entropy}(\text{parent}) - \sum w_i \text{ } \text{Entropy}(\text{child}_i) GainA(D)=Info(D)InfoA(D)\text{Gain}_A(D) = \text{Info}(D) - \text{Info}_A(D)

Split Info

SplitInfoA(D)=DjDlog2(1DjD)\text{SplitInfo}_A(D) = \sum \frac{|D_j|}{|D|} log_2(\frac{1}{\frac{|D_j|}{|D|}})

Split Info is a measure used to account for the number of splits and the distribution of data points across those splits. It quantifies the potential information generated by splitting the dataset DD into subsets DjD_j based on an attribute AA. SplitInfoA(D)SplitInfo_A(D) is the potential information generated by the split.

This can also be written as:

SplitInfoA(D)=DjDlog2(DjD)\text{SplitInfo}_A(D) = -\sum \frac{|D_j|}{|D|} \log_2\left(\frac{|D_j|}{|D|}\right)

DjD\frac{|D_j|}{|D|} is the proportion of data points in subset DjD_j relative to the entire dataset DD.

Why is Split Info Needed?

Split Info is used to normalize the Information Gain when evaluating splits in decision trees. Here's why it's important:

  1. Problem with Information Gain:

    • Information Gain (IG) measures the reduction in entropy (or uncertainty) after splitting the dataset based on an attribute AA.
    • However, IG tends to favor attributes with many unique values (e.g., an attribute like "ID" that has a unique value for each data point). Such attributes can create many small subsets DjD_j, leading to artificially high IG even if the split is not meaningful.
  2. Bias Toward High-Cardinality Attributes:

    • Without normalization, IG might prefer attributes that split the data into many small, pure subsets, even if those subsets are not useful for generalization.
    • This can lead to overfitting, where the decision tree becomes too complex and performs poorly on unseen data.
  3. Role of Split Info:

    • Split Info penalizes attributes that create many subsets or unevenly distributed subsets.
    • It quantifies the potential information generated by the split itself, independent of the class labels.
    • By dividing the Information Gain by Split Info, we get the Information Gain Ratio, which balances the benefit of the split (IG) against the complexity of the split (Split Info).

Information Gain Ratio

GainRatioA(D)=GainA(D)SplitInfoA(D)\text{GainRatio}_A(D) = \frac{\text{Gain}_A(D)}{\text{SplitInfo}_A(D)}

The Information Gain Ratio (IG / Split Info) ensures that decision trees are more balanced and less prone to overfitting.

The Gain Ratio ensures that the decision tree algorithm does not favor attributes with many unique values or uneven splits, leading to more balanced and generalizable trees.

Regression

Decision Tree Regression Clearly Explained! - YouTube

Variance Reduction

Var=1n(yiyˉ)2\text{Var} = \frac{1}{n} \sum (y_i - \bar{y})^2 Variance Reduction=Var(parent)wi Var(childi)\text{Variance Reduction} = \text{Var}(\text{parent}) - \sum w_i \text{ } \text{Var}(\text{child}_i)