Decision tree 決策樹
Decision Stump 樹樁
Decision tree, depth <= 1
Choose splitting attribute, use majority voting to determine class(𝑡)
Problem: Overfitting, noise sensitivity (if the feature chosen is noisy or irrelevant)
Solution: Build a decision tree instread, by recursively split (divide and conquer)
How to find good splitting attribute?
Minimizes error rate of decision stump by A
Misclassification (Error Rate)
-
: Set of data points at a particular node.
-
: Set of data points in particular branch or leaf node j
-
: Total number of data points in the dataset
-
: Number of data points in the subset .
-
: The probability that a data point in subset belongs to class . It is the conditional probability of class given subset .
-
: The maximum probability among all classes for the subset . It indicates the probability of the most likely class for the data points in .
-
: The misclassification rate for the dataset . It measures the proportion of data points that are incorrectly classified.
For the above example, the misclassification rate for the dataset is
Problem: Choose the splitting attribute that minimizes the misclassification rate may noy be Global Optimal. (Issue of greedy algorithm, a.k.a locally optimal) It neglects the distribution of the class values of misclassified instances. It only considers the majority class.
Solution: Use other metrics like Gini index, entropy, or information gain to find the best splitting attribute
Classification
Decision Tree Classification Clearly Explained! - YouTube
Entropy
Other name like Info
, same meaning
If only one class, - 1 log2(1) = 0
For split attribute A
Like Gini impurity, it quantifies the uncertainty or disorder in a dataset
Gini
For split attribute A
Why It Works?
Comparing the Gini impurity to , which only care about the majority class, the Gini impurity is more sensitive to changes in the class distribution. It penalizes the nodes with a higher number of classes.
Comparison: Entropy vs. Gini
Aspect | Entropy | Gini Impurity |
---|---|---|
Definition | Measures the uncertainty or disorder in the class distribution. | Measures the probability of misclassification of a random sample. |
Formula | ||
Range | (perfect purity) to (maximum disorder for classes). | (perfect purity) to (maximum impurity for classes). |
Behavior | More sensitive to changes in class probabilities, especially near or . | Less sensitive to changes in class probabilities. |
Computational Cost | Slightly higher due to the use of logarithms. | Lower, as it involves only squares of probabilities. |
Interpretation | Based on information theory (bits of information). | Based on probability of misclassification. |
Use in Decision Trees | Both are commonly used; entropy is more popular in algorithms like ID3 and C4.5, while Gini is used in CART. | Simpler and faster to compute, often preferred in practice. |
Information Gain (IG)
Other name like Gain
, same meaning
is weight
Split Info
Split Info is a measure used to account for the number of splits and the distribution of data points across those splits. It quantifies the potential information generated by splitting the dataset into subsets based on an attribute . is the potential information generated by the split.
This can also be written as:
is the proportion of data points in subset relative to the entire dataset .
Why is Split Info Needed?
Split Info is used to normalize the Information Gain when evaluating splits in decision trees. Here's why it's important:
-
Problem with Information Gain:
- Information Gain (IG) measures the reduction in entropy (or uncertainty) after splitting the dataset based on an attribute .
- However, IG tends to favor attributes with many unique values (e.g., an attribute like "ID" that has a unique value for each data point). Such attributes can create many small subsets , leading to artificially high IG even if the split is not meaningful.
-
Bias Toward High-Cardinality Attributes:
- Without normalization, IG might prefer attributes that split the data into many small, pure subsets, even if those subsets are not useful for generalization.
- This can lead to overfitting, where the decision tree becomes too complex and performs poorly on unseen data.
-
Role of Split Info:
- Split Info penalizes attributes that create many subsets or unevenly distributed subsets.
- It quantifies the potential information generated by the split itself, independent of the class labels.
- By dividing the Information Gain by Split Info, we get the Information Gain Ratio, which balances the benefit of the split (IG) against the complexity of the split (Split Info).
Information Gain Ratio
The Information Gain Ratio (IG / Split Info) ensures that decision trees are more balanced and less prone to overfitting.
The Gain Ratio ensures that the decision tree algorithm does not favor attributes with many unique values or uneven splits, leading to more balanced and generalizable trees.
Regression
Decision Tree Regression Clearly Explained! - YouTube