Reputation: 41
I use the dataset from https://www.kaggle.com/datasets/shilongzhuang/telecom-customer-churn-by-maven-analytics
Here there are many categorical values with missing datapoints. I am not sure how to deal with these missing values. Since almost every row has at least one missing value I can't just delete the rows. Using mean/mode also is not applicable to this dataset.
What can I do best to handle these missing values?
For example I tried to impute the variable Multiple.Lines like this:
telecom_customer_churn $ Multiple.Lines = impute(telecom_customer_churn$Multiple.Lines, "random")
This works, but when I try to make a bar plot like this:
ggplot(data = telecom_customer_churn) +
geom_histogram(mapping = aes(x = Multiple.Lines), color = "blue", fill = "lightblue")
It shows me the error:
Error: Discrete value supplied to continuous scale
This is weird to me because the all the missing values of Multiple.Lines are replaced by either 'yes' or 'no'.
Upvotes: 0
Views: 118
Reputation: 590
Here's a solution to your problem:
The problem was that the object type was changed to "impute". Applying as.factor()
forces the data to be read as categorial once again.
library(Hmisc)
telecom_customer_churn = data.frame(Multiple.Lines = c(rep("yes",5), NA, rep("no",7), rep(NA,2), rep("maybe",3)))
telecom_customer_churn$Multiple.Lines = impute(telecom_customer_churn$Multiple.Lines, "random")
ggplot(telecom_customer_churn, aes(x = as.factor(Multiple.Lines))) +geom_bar(color = "blue", fill = "lightblue")
Upvotes: 1