Reputation: 1925
I have ~150 logical variables and want to remove trivial ones (all data values are FALSE). How can I do that with dplyr?
What I'm doing (maybe I don't need that at all, still learning). I have data where each data point is categorized. The trick is that the same point may have multiple categories, so it's not a factor:
y | x | domain
------------------
0 | 1 | dogs,animals
1 | 5 | cats,animals
And I'd like to build a prediction model for y
. I converted this structure (outside R) into logical columns:
y | x | d_dogs | d_cats | d_animals
-----------------------------------
0 | 1 | T | F | T
1 | 5 | F | T | T
and am building a regression model on that. (Categories are nested on themselves, but that's another topic).
But some categories have too little data points (all, or almost all values are F
), so I want to remove them. Without dplyr I do:
keep.columns <- sapply(colnames(data), function(n) {
c <- data[,n];
!is.logical(c) || sum(c) > 1
})
data[, keep.columns]
But curious if I can do that easier.
Upvotes: 1
Views: 90
Reputation: 1693
To find columns with trivial (identical) values, you could try this:
df <- data.frame(a = c(1,1,1,1,1), b = c(1,2,3,4,5), c = c("a","a","a","a","a"))
df %>%
summarise_each(funs(n_distinct))
Output:
a b c
1 1 5 1
i.e., cols "a" and "c" have only 1 unique/distinct value
Upvotes: 1
Reputation: 15927
You are basically doing the right thing, but a small simplification is possible:
data[ , !sapply(data, is.logical) | (colSums(data) > 1)]
Let me show how it works with and example data set:
data <- data.frame(x = 1:6,
d_dogs = rep(FALSE, 6),
d_cats = rep(c(FALSE, TRUE), 3),
d_horses = rep(TRUE, 6),
d_animals = c(rep(FALSE, 5), TRUE))
data
## x d_dogs d_cats d_horses d_animals
## 1 1 FALSE FALSE TRUE FALSE
## 2 2 FALSE TRUE TRUE FALSE
## 3 3 FALSE FALSE TRUE FALSE
## 4 4 FALSE TRUE TRUE FALSE
## 5 5 FALSE FALSE TRUE FALSE
## 6 6 FALSE TRUE TRUE TRUE
Instead of using sapply
to apply your "complicated" function, you can just use it to get the columns that are not logical as follows:
!sapply(data, is.logical)
## x d_dogs d_cats d_horses d_animals
## TRUE FALSE FALSE FALSE FALSE
And to get the number of TRUE
per column, you can use colSums
:
colSums(data)
## x d_dogs d_cats d_horses d_animals
## 21 0 3 6 1
Putting everything together:
data[ , !sapply(data, is.logical) | (colSums(data) > 1)]
## d_cats d_horses
## 1 FALSE TRUE
## 2 TRUE TRUE
## 3 FALSE TRUE
## 4 TRUE TRUE
## 5 FALSE TRUE
## 6 TRUE TRUE
You could use dplyr
, but I don't think that it really offers of a simplification here. This would work:
select(data, which(!sapply(data, is.logical) | (colSums(data) > 1)))
Upvotes: 2
Reputation: 887851
We could use Filter
Filter(function(x) !is.logical(x) | sum(x)>1, data)
Upvotes: 4