The Data Mining Forum                             open-source data mining software data mining conferences Data Science for Social and Behavioral Analytics DSSBA 2022 data science journal
IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum:
ID3 and C4.5: how does “gain ratio” normalize “gain”?
Posted by: Yoni
Date: November 07, 2012 02:07AM

The ID3 algorithm uses "Information Gain" measure.

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.

My question is:

How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.

On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.

What am I missing?

Options: ReplyQuote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.