Aiman
Aiman

Reputation: 21

discretization in weka

I need to know when is the right time to do discretization in weka.I have data set,i need to create training and testing data samples from that data. Should i do the discretization for the numerical attributes before the sampling or after the sampling?

Upvotes: 0

Views: 3101

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

This should be obvious.

As long as you get the same result independent of the split performed you can do it afterwards. But what is the benefit of that? Just do the preprocessing first then.

If you discretize by rounding - e.g. float to integer - then you should be fine (which is unaffected by the split). But if you discretize e.g. by quantiles, it should be obvious that you can screw up badly, because you will discretize the different parts differently!

Let's say you discretize data into two different values:

Input data    Type     Output value
0.9           good     1.05
1.0           good     1.05
1.1           good     1.05
1.2           good     1.05
---
2.1           good     2.20
2.3           good     2.20
2.2           good     2.20
---  SPLIT HERE ---
1.1           bad      1.20
1.2           bad      1.20
1.3           bad      1.20
---
1.9           bad      2.00
2.0           bad      2.00
2.1           bad      2.00

See, both "good" and "bad" were discretized into two discrete values, by using the average of each cluster of values. But as the averages for "good" and "bad" differ, the resulting attribute clearly exposes the true membership. The task of detecting "bad" has become substantially easier.

Do not perform separate preprocessing, ever.

Upvotes: 2

Related Questions