JasonEdinburgh
JasonEdinburgh

Reputation: 689

Weka Resample to balance instances in binary dataset

I've only been using Weka for a couple of weeks but I am absolutely blown away by how great it is!

But I have a question, I have a dataset with a target column which is either True or False.

6709 instances in my dataset are True

25318 instances are False.

I want to randomly add duplicates of my True instances to produce a new dataset with 25318 True and 25318 False.

The only filter I can find which does this is the supervised Resample filter however I am having trouble understanding what parameters I should use.

(there might be a better filter to do what I want)

I've got some success with these parameters

biasToUniformClass = 1.0
invertSelection = False
noReplacement = False
randomSeed = 1
sampleSizePercent = 157.5 (a magic number I've arrived at by trial and error)

This produces 25277 True and 25165 False. Not exactly what I want, but quite close.

The problem is that I cant figure out how to arrive at the magic number. I'm also not getting exactly the numbers of instances that I really want.

Is there a better filter for this purpose? If not, is there a way to calculate the sampleSizePercent magic number?

Any help is greatly appreciated :)

Supplemental question, am I best to run NominalToBinary on my boolean columns to ensure they are Binary? I'm using a NaiveBayes classifier (at the moment) and I don't have any missing instances.

Jason

Upvotes: 0

Views: 2719

Answers (1)

Matthew Spencer
Matthew Spencer

Reputation: 2295

I think the tricky part of this question is getting a perfect balance using the Resample Filter. This is because, as it is stated in the description, it 'Produces a random sub-sample of a dataset using either sampling with replacement or without replacement'. If these cases are being drawn randomly, there is no guarantee that you will get an equal measure between the two classes.

As for the magic number, this would be associated with the total number of cases that you would like to have when the filter is applied. In your case, it would be 50636 instead of 32027. In this case, your magic number would be 50636 / 32027 = 1.581. However, as stated above, you may not get an exact match of true and false cases.

If you really need an exact figure, you could use your favourite spreadsheet and preprocess the data. One possible method is to randomise the true cases (in a separate column), sort and copy all of the cases until the number matches the false one. It's not an automated solution, and the solution is outside of Weka, but I have used this method before and does the job reasonably quickly.

Hope this Helps!

Upvotes: 2

Related Questions