Reputation: 185
I'm trying to do some basic multi-label classification in Azure ML. I have some basic data in the following format:
value_x value_y label
x1 y1 label1
x2 y2 label1
x3 y3 label2
.....
My problem is that in my data certain labels (out of a total of five) are overrepresented, as about 40% of the data is label1, about 20% is label 2 and the rest around 10%.
I would like to get a sampling out of these to train my model, so that each label is represented in equal amounts.
Tried the stratification option in the Sampling module on the labels column, but that just gives me a sampling with the same distribution of labels as in the initial dataset.
Any idea how I could do this with a module?
Upvotes: 0
Views: 601
Reputation: 5225
I was able to do this using a combination of Split Data, Partition and Sample, and Add Rows modules. There may be an easier way to do it, but I did confirm it works. :) I published my work at http://gallery.azureml.net/Details/1245147fd7004e91bc7a3683cda19cc7 so you can grab it directly from there, and run to confirm it does what you expect.
Since you said you wanted a sampling of the data, I just reduced each of the labels to 10% to have all labels represented equally. Since you have a good understanding of the distribution in your dataset, leave label 3, 4, and 5 all at about 10%, and reduce label 1 by 1/4 and label 2 by 1/2 to get about 10% of them as well.
To explain what I did in the workspace linked above:
Finally, I didn't include this in my work, but you can also look at the SMOTE module. It will increase the number of low-occurring samples using synthetic minority oversampling.
Upvotes: 3