ID3 and C4.5: how does “gain ratio” normalize “gain”?

The Data Mining Forum

open-source data mining software

data mining conferences

Data Science for Social and Behavioral Analytics DSSBA 2022

data science journal

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

ID3 and C4.5: how does “gain ratio” normalize “gain”?

Posted by: Yoni

Date: November 07, 2012 02:07AM

The ID3 algorithm uses "Information Gain" measure.

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.

My question is:

How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.

On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.

What am I missing?

Options: Reply•Quote

This forum is powered by Phorum and provided by P. Fournier-Viger (© 2012).
Terms of use.