Deepak Chaudhary
Deepak Chaudhary

Reputation: 93

Handling a missing value in machine learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.

Now i have some questions regarding above approach.

Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)

Upvotes: 1

Views: 239

Answers (1)

hakansander
hakansander

Reputation: 377

Can you please clarify your dataset a little bit more?

First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.

Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.

If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.

Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":

You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.

Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.

Upvotes: 1

Related Questions