Alam
Alam

Reputation: 33

percentage of missing data in multiple variables in R

I have a data set which has missing data. I have found that there are 6 variables with missing data. I wanted to check the percentage of data however I have used the mean is.na however I am not sure if this is correct and I know there is a much simpler way to check this than use repetitive codes as you can see below:

Question is, what is the best code to get percentage of missing data in multiple variables?

PS. I am hoping for it to look like the delete column code I have which removes the columns

--------------------CODE--------------------------------

mean(is.na(TrainDataSet$KF6 ))
mean(is.na(TrainDataSet$KF9 ))
mean(is.na(TrainDataSet$KF10 ))
mean(is.na(TrainDataSet$F1 ))
mean(is.na(TrainDataSet$T2 ))
mean(is.na(TrainDataSet$ST7 ))

#Delete columns with missing data from TrainingSet

TrainDataSet <- TrainDataSet[ , -c(11, 14 , 15 , 21 , 28, 54)]

I am getting responses for all the columns, please provide a solution for only the 6 columns above **(KF6, KF9, KF10, F1, T2, ST7) **

Upvotes: 0

Views: 1076

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

colMeans(is.na(airquality))

     Ozone    Solar.R       Wind       Temp      Month        Day 
0.24183007 0.04575163 0.00000000 0.00000000 0.00000000 0.00000000

If you just want certain columns, you could use:

colMeans(is.na(airquality[c("Solar.R", "Wind")]))
#colMeans(is.na(airquality[, 2:3]))   # equivalent by column position
   Solar.R       Wind 
0.04575163 0.00000000 

Alternatively, with dplyr you could use summarize(across... to apply your code to every specified column:

library(dplyr)
airquality %>% summarize(across(c(Solar.R, Wind), ~mean(is.na(.x))))

     Solar.R Wind
1 0.04575163    0

Upvotes: 1

Related Questions