Mart
Mart

Reputation: 41

How to replace missing categorical data from large dataset in R?

I use the dataset from https://www.kaggle.com/datasets/shilongzhuang/telecom-customer-churn-by-maven-analytics
Here there are many categorical values with missing datapoints. I am not sure how to deal with these missing values. Since almost every row has at least one missing value I can't just delete the rows. Using mean/mode also is not applicable to this dataset.

What can I do best to handle these missing values?

For example I tried to impute the variable Multiple.Lines like this:

telecom_customer_churn $ Multiple.Lines = impute(telecom_customer_churn$Multiple.Lines, "random")

This works, but when I try to make a bar plot like this:

ggplot(data = telecom_customer_churn) +
  geom_histogram(mapping = aes(x = Multiple.Lines),  color = "blue", fill = "lightblue")

It shows me the error:

Error: Discrete value supplied to continuous scale

This is weird to me because the all the missing values of Multiple.Lines are replaced by either 'yes' or 'no'.

Upvotes: 0

Views: 118

Answers (1)

mri
mri

Reputation: 590

Here's a solution to your problem: The problem was that the object type was changed to "impute". Applying as.factor() forces the data to be read as categorial once again.

library(Hmisc)

telecom_customer_churn = data.frame(Multiple.Lines = c(rep("yes",5), NA, rep("no",7), rep(NA,2), rep("maybe",3)))
telecom_customer_churn$Multiple.Lines = impute(telecom_customer_churn$Multiple.Lines, "random")

ggplot(telecom_customer_churn, aes(x = as.factor(Multiple.Lines))) +geom_bar(color = "blue", fill = "lightblue")

Upvotes: 1

Related Questions