Reputation: 21
I need to know when is the right time to do discretization in weka.I have data set,i need to create training and testing data samples from that data. Should i do the discretization for the numerical attributes before the sampling or after the sampling?
Upvotes: 0
Views: 3101
Reputation: 77454
This should be obvious.
As long as you get the same result independent of the split performed you can do it afterwards. But what is the benefit of that? Just do the preprocessing first then.
If you discretize by rounding - e.g. float to integer - then you should be fine (which is unaffected by the split). But if you discretize e.g. by quantiles, it should be obvious that you can screw up badly, because you will discretize the different parts differently!
Let's say you discretize data into two different values:
Input data Type Output value
0.9 good 1.05
1.0 good 1.05
1.1 good 1.05
1.2 good 1.05
---
2.1 good 2.20
2.3 good 2.20
2.2 good 2.20
--- SPLIT HERE ---
1.1 bad 1.20
1.2 bad 1.20
1.3 bad 1.20
---
1.9 bad 2.00
2.0 bad 2.00
2.1 bad 2.00
See, both "good" and "bad" were discretized into two discrete values, by using the average of each cluster of values. But as the averages for "good" and "bad" differ, the resulting attribute clearly exposes the true membership. The task of detecting "bad" has become substantially easier.
Do not perform separate preprocessing, ever.
Upvotes: 2