Frank B.
Frank B.

Reputation: 1873

Using is.na in R to get Column Names that Contain NA Values

Given the example data set below:

df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA, 
                              7, NA, 9, 10, NA, NA), nrow=2, ncol=6))

names(df) <- c(  "varA", "varB", "varC", "varD", "varE", "varF")

print(df)

  varA varB varC varD varE varF
1    1    3    5    7    9   NA
2    2   NA   NA   NA   10   NA

I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question.

The manual version of what I'd like is:

kmeans_model <- kmeans(df[, -c(2:4, 6)], 10) 

And the pseudo-code would be:

kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10) 

Also, I don't want to delete the data from df. Thanks in advance.

(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set)

Upvotes: 2

Views: 12463

Answers (2)

Saurabh Jain
Saurabh Jain

Reputation: 1712

This is the generic approach that I use for listing column names and their count of NAs:

sort(colSums(is.na(df)> 0), decreasing = T)

If you want to use sapply, you can refer this code snippet as well:

flights_NA_cols <- sapply(flights, function(x) sum(is.na(x))) 
flights_NA_cols[flights_NA_cols>0]

Upvotes: 2

talat
talat

Reputation: 70266

Here are two options without sapply:

kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10) 

Or

kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10) 

Explanation:

colSums(is.na(df)) counts the number of NAs per column, resulting in:

colSums(is.na(df))
#varA varB varC varD varE varF 
#   0    1    1    1    0    2 

And then

colSums(is.na(df)) == 0     # converts to logical TRUE/FALSE
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE 

is the same as

!colSums(is.na(df))
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE 

Both methods can be used to subset only those columns where the logical value is TRUE

Upvotes: 6

Related Questions