doubt in input format
Posted by:
Vin
Date: January 03, 2015 12:33AM
Have a small doubt in data mining.
suppose i have a dataset as folllows:
PrdId P1 P2 Target
1 1.2 456 H
1 1.23 400 L
1 1.0 412 L
1 0.8 424 N
1 0.6 400 L
2 4.6 520 N
2 4.7 550 H
2 4.68 550 H
2 4.0 530 N
2 4.0 500 L
3 3.3 345 H
3 3.3 340 L
3 3.6 345 H
3 3.6 340 L
now, i want to feed this to one mining algorithm(Random forest). But for current instance lets take decision tree.
If i feed this to decision tree, then the rules that will be formed will be of form
if prdid=1 and P2<410 thenTarget = L
Now this means these rules will be PrdId specific, so if a new data set series is given as input, it will now understand as the new id was not known to it during training.
Problem One : Pl. tell how can I specify group Id in models so that the rules can be generated for goups. I can find any. Accept in Association Rule mining, which cannot be applied for this kind of dataset.
So, I am calculating coeffict of variation between two column so I get a single reading for each prdid, and max frequency in target.
PrdId CV1 CV2 Target
1 0.3 0.4 L
2. 0.4 0.3 H
then giving this as input to model and generating rules,now i can ignore PrdId in put.
Problem 2: Is this way of calculating statistical values for group data and finding statistical values as input to mining algorithm valid?