Using is.na in R to get Column Names that Contain NA Values

Question

Given the example data set below:

df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA, 
                              7, NA, 9, 10, NA, NA), nrow=2, ncol=6))

names(df) <- c(  "varA", "varB", "varC", "varD", "varE", "varF")

print(df)

  varA varB varC varD varE varF
1    1    3    5    7    9   NA
2    2   NA   NA   NA   10   NA

I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question.

The manual version of what I'd like is:

kmeans_model <- kmeans(df[, -c(2:4, 6)], 10)

And the pseudo-code would be:

kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10)

Also, I don't want to delete the data from df. Thanks in advance.

(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set)

talat · Accepted Answer

Here are two options without sapply:

kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10)

Or

kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10)

Explanation:

colSums(is.na(df)) counts the number of NAs per column, resulting in:

colSums(is.na(df))
#varA varB varC varD varE varF 
#   0    1    1    1    0    2

And then

colSums(is.na(df)) == 0     # converts to logical TRUE/FALSE
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE

is the same as

!colSums(is.na(df))
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE

Both methods can be used to subset only those columns where the logical value is TRUE

Using is.na in R to get Column Names that Contain NA Values

Answers (2)

Explanation:

Related Questions