Reputation: 1229
When working with data frames, it is common to need a subset. However use of the subset function is discouraged. The trouble with the following code is that the data frame name is repeated twice. If you copy&paste and munge code, it is easy to accidentally not change the second mention of adf which can be a disaster.
adf=data.frame(a=1:10,b=11:20)
print(adf[which(adf$a>5),]) ##alas, adf mentioned twice
print(with(adf,adf[{a>5},])) ##alas, adf mentioned twice
print(subset(adf,a>5)) ##alas, not supposed to use subset
Is there a way to write the above without mentioning adf twice? Unfortunately with with() or within(), I cannot seem to access adf as a whole?
The subset(...) function could make it easy, but they warn to not use it:
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Upvotes: 7
Views: 1005
Reputation: 1229
After some thought, I wrote a super simple function called given:
given=function(.,...) { with(.,...) }
This way, I don't have to repeat the name of the data.frame. I also found it to be 14 times faster than filter()
. See below:
adf=data.frame(a=1:10,b=11:20)
given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Using microbenchmark
> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
a b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
a b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
a b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507 1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556 1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521 1000
I noticed that given(
) is actually a tad faster than with()
, due to the length of the variable name.
The neat thing about given
, is that you can do some things inline without assignment:
given(data.frame(a=1:10,b=11:20),.[a>5 & b<18,])
Upvotes: 1
Reputation: 4444
As @akrun states, I would use dplyr
's filter
function:
require("dplyr")
new <- filter(adf, a > 5)
new
In practice, I don't find the subsetting notation ([ ]
) problematic because if I copy a block of code, I use find and replace within RStudio to replace all mentions of the dataframe in the selected code. Instead, I use dplyr because the notation and syntax is easier to follow for new users (and myself!), and because the dplyr functions 'do one thing well.'
Upvotes: 1