ID3 and C4.5: how does “gain ratio” normalize “gain”?
Date: November 07, 2012 02:07AM
The ID3 algorithm uses "Information Gain" measure.
The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.
My question is:
How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.
It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.
On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.
What am I missing?