why are there two different trees for the same data set

The Data Mining Forum

IMPORTANT: This is the old Data Mining forum.
I keep it online so that you can read the old messages.

Please post your new messages in the new forum: https://forum2.philippe-fournier-viger.com/index.php

Goto Topic: Previous•Next

Goto: Forum List•Message List•New Topic•Search•Log In•Print View

why are there two different trees for the same data set

Posted by: user

Date: February 21, 2022 05:25AM

I am working on Pima Indians Diabetes Database in Weka. I noticed that for decision tree J48 the tree is smaller as compared to the Random Tree. I am unable to understand why it is different when the data set is same?

Options: Reply•Quote

Re: why are there two different trees for the same data set

Posted by: webmasterphilfv

Date: February 21, 2022 04:23PM

Hi,

There are different algorithms to build decision trees. A typical algorithm for building decision trees will for example build a tree from the top to the bottom, node by node. To decide which attribute to use in a node, the algorithm will use some criteria to compare the attributes.
There exists various criteria like the GINI, information gain, etc.

For example, the ID3 algorithm will use the entropy, while C4.5 will use the GINI measure. If two algorithms dont use the same criteria to select attributes and build trees, then the result can be different.

Besides, there can be other techniques use by some algorithm like to prune the trees etc.

I am not sure what is used by RandomForest and J.48, as I did not read these papers recently, but it is likely different So it is quite normal that they dont have the same output.

Best regards,

Options: Reply•Quote

Re: why are there two different trees for the same data set

Posted by: user

Date: February 21, 2022 10:59PM

Thank you for the detailed response. I will look into it further.

For a data set of two classes verified and not_verified, I have the following 'detailed accuracy by class'

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.746 0.440 0.760 0.746 0.753 0.303 0.653 0.732 verified
0.560 0.254 0.542 0.560 0.550 0.303 0.653 0.457 not_verified

Why verified class has higher F-Measure than the not_verified class?

Options: Reply•Quote

Re: why are there two different trees for the same data set

Posted by: webmasterphilfv

Date: February 25, 2022 01:20AM

Just by looking at these numbers, I dont know. I would recommend to look up at the definitions.

Maybe someone else can answer

Edited 1 time(s). Last edit at 02/25/2022 01:21AM by webmasterphilfv.

Options: Reply•Quote