Chris
Chris

Reputation: 1229

In R subsetting without using subset() and use [ in a more concise manner to prevent typos?

When working with data frames, it is common to need a subset. However use of the subset function is discouraged. The trouble with the following code is that the data frame name is repeated twice. If you copy&paste and munge code, it is easy to accidentally not change the second mention of adf which can be a disaster.

adf=data.frame(a=1:10,b=11:20)
print(adf[which(adf$a>5),])  ##alas, adf mentioned twice
print(with(adf,adf[{a>5},])) ##alas, adf mentioned twice
print(subset(adf,a>5)) ##alas, not supposed to use subset

Is there a way to write the above without mentioning adf twice? Unfortunately with with() or within(), I cannot seem to access adf as a whole?

The subset(...) function could make it easy, but they warn to not use it:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Upvotes: 7

Views: 1005

Answers (2)

Chris
Chris

Reputation: 1229

After some thought, I wrote a super simple function called given:

given=function(.,...) { with(.,...) }

This way, I don't have to repeat the name of the data.frame. I also found it to be 14 times faster than filter(). See below:

adf=data.frame(a=1:10,b=11:20)
given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)

Using microbenchmark

> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
  a  b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
  a  b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
  a  b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
                             expr    min     lq     mean median     uq     max neval
 with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507  1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
                            expr    min     lq     mean median    uq     max neval
 given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556  1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
                              expr     min       lq     mean   median       uq      max neval
 dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521  1000

I noticed that given() is actually a tad faster than with(), due to the length of the variable name.

The neat thing about given, is that you can do some things inline without assignment: given(data.frame(a=1:10,b=11:20),.[a>5 & b<18,])

Upvotes: 1

Phil
Phil

Reputation: 4444

As @akrun states, I would use dplyr's filter function:

require("dplyr")
new <- filter(adf, a > 5)
new

In practice, I don't find the subsetting notation ([ ]) problematic because if I copy a block of code, I use find and replace within RStudio to replace all mentions of the dataframe in the selected code. Instead, I use dplyr because the notation and syntax is easier to follow for new users (and myself!), and because the dplyr functions 'do one thing well.'

Upvotes: 1

Related Questions