Is duplicating data a valid way to fix bias?

Question

I’m reading a paper in the area of engineering. They have a labelled dataset which is biased. There are many more instances labelled A than B. They want to train a classifier to predict the A or B label based on some inputs (states).

The authors say:

To artiﬁcially remedy this problem, random replicas of the B states are incorporated into the dataset to even out the lot.

I don’t know much on data analytics, but this doesn’t sound very valid to me. Is it?

venkatadileep · Accepted Answer

This type of data normally called as imbalanced data. what author said was right to deal with imbalanced data we need to add some duplication to bring as a balanced(but instead of adding randomly will see the data patterns and add the data). there many algorithms methods to deal with imbalance classification just go through this it might help you https://datascience.stackexchange.com/questions/24392/why-we-need-to-handle-data-imbalance

Is duplicating data a valid way to fix bias?

Answers (1)

Related Questions