Reputation: 1873
Given the example data set below:
df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA,
7, NA, 9, 10, NA, NA), nrow=2, ncol=6))
names(df) <- c( "varA", "varB", "varC", "varD", "varE", "varF")
print(df)
varA varB varC varD varE varF
1 1 3 5 7 9 NA
2 2 NA NA NA 10 NA
I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question.
The manual version of what I'd like is:
kmeans_model <- kmeans(df[, -c(2:4, 6)], 10)
And the pseudo-code would be:
kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10)
Also, I don't want to delete the data from df. Thanks in advance.
(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set)
Upvotes: 2
Views: 12463
Reputation: 1712
This is the generic approach that I use for listing column names and their count of NAs:
sort(colSums(is.na(df)> 0), decreasing = T)
If you want to use sapply, you can refer this code snippet as well:
flights_NA_cols <- sapply(flights, function(x) sum(is.na(x)))
flights_NA_cols[flights_NA_cols>0]
Upvotes: 2
Reputation: 70266
Here are two options without sapply
:
kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10)
Or
kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10)
colSums(is.na(df))
counts the number of NAs per column, resulting in:
colSums(is.na(df))
#varA varB varC varD varE varF
# 0 1 1 1 0 2
And then
colSums(is.na(df)) == 0 # converts to logical TRUE/FALSE
#varA varB varC varD varE varF
#TRUE FALSE FALSE FALSE TRUE FALSE
is the same as
!colSums(is.na(df))
#varA varB varC varD varE varF
#TRUE FALSE FALSE FALSE TRUE FALSE
Both methods can be used to subset only those columns where the logical value is TRUE
Upvotes: 6