Reputation: 1285
Apologies in advance for the clunky code. I have a data frame similar to the following:
df <- data.frame(c(rep_len(1,5), 2, 2), c("A", "A", "B", "B", "C", "C", "C"))
names(df) <- c("id", "consequence")
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
5 1 C
6 2 C
7 2 C
I would like to perform the following filtering action:
if a group by id contains consequences A or B, then keep these rows, and remove rows with consequence C. If a group contains only C, or a single row, then keep those/that rows/row.
I have tried to do this in dplyr with a custom function, but have the problem that all the rows are filtered, thus eliminating all of consequence C:
# filtering function:
consequence_select <- function(x) {
if(n_distinct(x$consequence) > 1) {
if(any(unique(x$consequence) %in% c("A", "B"))) {
x %>%
filter(consequence %in% c("A", "B"))} else {return(x)}
} else {return(x)}
}
df %>%
group_by(id) %>%
consequence_select
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
I was able to do this correctly with plyr:
ddply(df, .(id), consequence_select)
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
5 2 C
6 2 C
Upvotes: 3
Views: 229
Reputation: 70266
You could optimize your code by applying it only inside a filter
argument and not inside a do
as filter
is the specialized dplyr function for such a task. I created two functions and benchmarked them with the existing answers. Which function you want to use depends on your requirements - for the sample data, they both produce the same result. I also created a slightly larger sample data for the benchmark, as below.
# sample data
df <- data.frame(id = sample(100, 1000, replace = T),
consequence = sample(LETTERS[1:3], 1000, replace = TRUE, prob = c(0.2, 0.2, 0.6)))
# the existing custom function
consequence_select <- function(x) {
if(n_distinct(x$consequence) > 1) {
if(any(unique(x$consequence) %in% c("A", "B"))) {
x %>%
filter(consequence %in% c("A", "B"))} else {return(x)}
} else {return(x)}
}
# eipi's answer
f1 <- function() {
df %>%
group_by(id) %>%
do(consequence_select(.)) }
# jazzuro's answer
f2 <- function() {
df %>%
group_by(id) %>%
do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]}) }
# my answer 1
f3a <- function() {
df %>%
group_by(id) %>%
filter((consequence != "C" & n_distinct(consequence) > 1L) | all(consequence == "C") )
}
# my answer 2
f3b <- function() {
df %>%
group_by(id) %>%
filter((consequence %in% c("A", "B") & n_distinct(consequence) > 1L) | all(consequence == "C"))
}
library(microbenchmark)
microbenchmark(f1(), f2(), f3a(), f3b(), unit = "relative")
Unit: relative
expr min lq median uq max neval
f1() 11.243524 11.092915 10.956129 10.717519 8.859949 100
f2() 6.603549 6.663674 6.653424 6.566012 10.956784 100
f3a() 1.279952 1.294679 1.291719 1.294606 1.165322 100
f3b() 1.000000 1.000000 1.000000 1.000000 1.000000 100
all.equal(f1(), f3a())
#[1] TRUE
all.equal(f1(), f3b())
#[1] TRUE
As you can see, a slight increase in data size already reveals a >10 times speed difference between the functions.
Upvotes: 4
Reputation: 23574
You can do your function like this using do
. foo
is your data.
foo %>%
group_by(id) %>%
do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]})
# id consequence
#1 1 A
#2 1 A
#3 1 B
#4 1 B
#5 2 C
#6 2 C
Upvotes: 3
Reputation: 93811
With dplyr
you need to wrap the function in do
:
df %>%
group_by(id) %>%
do(consequence_select(.))
The .
is a "pronoun" that refers to the data frame df
.
Upvotes: 4