Dzmitry Lazerka
Dzmitry Lazerka

Reputation: 1925

dplyr: Create columns based on their data

I have ~150 logical variables and want to remove trivial ones (all data values are FALSE). How can I do that with dplyr?

What I'm doing (maybe I don't need that at all, still learning). I have data where each data point is categorized. The trick is that the same point may have multiple categories, so it's not a factor:

y | x | domain
------------------
0 | 1 | dogs,animals
1 | 5 | cats,animals

And I'd like to build a prediction model for y. I converted this structure (outside R) into logical columns:

y | x | d_dogs | d_cats | d_animals
-----------------------------------
0 | 1 |    T   |    F   |    T
1 | 5 |    F   |    T   |    T

and am building a regression model on that. (Categories are nested on themselves, but that's another topic).

But some categories have too little data points (all, or almost all values are F), so I want to remove them. Without dplyr I do:

keep.columns <- sapply(colnames(data), function(n) {
    c <- data[,n];
    !is.logical(c) || sum(c) > 1
})
data[, keep.columns]

But curious if I can do that easier.

Upvotes: 1

Views: 90

Answers (3)

Sebastian Sauer
Sebastian Sauer

Reputation: 1693

To find columns with trivial (identical) values, you could try this:

df <- data.frame(a = c(1,1,1,1,1), b = c(1,2,3,4,5), c = c("a","a","a","a","a"))

df %>% 
summarise_each(funs(n_distinct))

Output:

  a b c
1 1 5 1

i.e., cols "a" and "c" have only 1 unique/distinct value

Upvotes: 1

Stibu
Stibu

Reputation: 15927

You are basically doing the right thing, but a small simplification is possible:

data[ , !sapply(data, is.logical) | (colSums(data) > 1)]

Let me show how it works with and example data set:

data <- data.frame(x = 1:6,
                   d_dogs = rep(FALSE, 6),
                   d_cats = rep(c(FALSE, TRUE), 3),
                   d_horses = rep(TRUE, 6),
                   d_animals = c(rep(FALSE, 5), TRUE))
data
##   x d_dogs d_cats d_horses d_animals
## 1 1  FALSE  FALSE     TRUE     FALSE
## 2 2  FALSE   TRUE     TRUE     FALSE
## 3 3  FALSE  FALSE     TRUE     FALSE
## 4 4  FALSE   TRUE     TRUE     FALSE
## 5 5  FALSE  FALSE     TRUE     FALSE
## 6 6  FALSE   TRUE     TRUE      TRUE

Instead of using sapply to apply your "complicated" function, you can just use it to get the columns that are not logical as follows:

!sapply(data, is.logical)
##     x    d_dogs    d_cats  d_horses d_animals 
##  TRUE     FALSE     FALSE     FALSE     FALSE 

And to get the number of TRUE per column, you can use colSums:

colSums(data)
##         x    d_dogs    d_cats  d_horses d_animals 
##        21         0         3         6         1 

Putting everything together:

data[ , !sapply(data, is.logical) | (colSums(data) > 1)]
##   d_cats d_horses
## 1  FALSE     TRUE
## 2   TRUE     TRUE
## 3  FALSE     TRUE
## 4   TRUE     TRUE
## 5  FALSE     TRUE
## 6   TRUE     TRUE

You could use dplyr, but I don't think that it really offers of a simplification here. This would work:

select(data, which(!sapply(data, is.logical) | (colSums(data) > 1)))

Upvotes: 2

akrun
akrun

Reputation: 887851

We could use Filter

 Filter(function(x) !is.logical(x) | sum(x)>1, data)

Upvotes: 4

Related Questions