Reputation: 33
I have a data set which has missing data. I have found that there are 6 variables with missing data. I wanted to check the percentage of data however I have used the mean is.na however I am not sure if this is correct and I know there is a much simpler way to check this than use repetitive codes as you can see below:
Question is, what is the best code to get percentage of missing data in multiple variables?
PS. I am hoping for it to look like the delete column code I have which removes the columns
--------------------CODE--------------------------------
mean(is.na(TrainDataSet$KF6 ))
mean(is.na(TrainDataSet$KF9 ))
mean(is.na(TrainDataSet$KF10 ))
mean(is.na(TrainDataSet$F1 ))
mean(is.na(TrainDataSet$T2 ))
mean(is.na(TrainDataSet$ST7 ))
#Delete columns with missing data from TrainingSet
TrainDataSet <- TrainDataSet[ , -c(11, 14 , 15 , 21 , 28, 54)]
I am getting responses for all the columns, please provide a solution for only the 6 columns above **(KF6, KF9, KF10, F1, T2, ST7) **
Upvotes: 0
Views: 1076
Reputation: 66415
colMeans(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
0.24183007 0.04575163 0.00000000 0.00000000 0.00000000 0.00000000
If you just want certain columns, you could use:
colMeans(is.na(airquality[c("Solar.R", "Wind")]))
#colMeans(is.na(airquality[, 2:3])) # equivalent by column position
Solar.R Wind
0.04575163 0.00000000
Alternatively, with dplyr you could use summarize(across...
to apply your code to every specified column:
library(dplyr)
airquality %>% summarize(across(c(Solar.R, Wind), ~mean(is.na(.x))))
Solar.R Wind
1 0.04575163 0
Upvotes: 1