Reputation: 461
I’m reading a paper in the area of engineering. They have a labelled dataset which is biased. There are many more instances labelled A than B. They want to train a classifier to predict the A or B label based on some inputs (states).
The authors say:
To artificially remedy this problem, random replicas of the B states are incorporated into the dataset to even out the lot.
I don’t know much on data analytics, but this doesn’t sound very valid to me. Is it?
Upvotes: 2
Views: 42
Reputation: 183
This type of data normally called as imbalanced data. what author said was right to deal with imbalanced data we need to add some duplication to bring as a balanced(but instead of adding randomly will see the data patterns and add the data). there many algorithms methods to deal with imbalance classification just go through this it might help you https://datascience.stackexchange.com/questions/24392/why-we-need-to-handle-data-imbalance
Upvotes: 1