gatsky
gatsky

Reputation: 1285

Using functions in dplyr incorporating values in other rows

Apologies in advance for the clunky code. I have a data frame similar to the following:

df <- data.frame(c(rep_len(1,5), 2, 2), c("A", "A", "B", "B", "C", "C", "C"))
names(df) <- c("id", "consequence")

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B
5  1           C
6  2           C
7  2           C

I would like to perform the following filtering action:

if a group by id contains consequences A or B, then keep these rows, and remove rows with consequence C. If a group contains only C, or a single row, then keep those/that rows/row.

I have tried to do this in dplyr with a custom function, but have the problem that all the rows are filtered, thus eliminating all of consequence C:

# filtering function:
consequence_select <- function(x) {
  if(n_distinct(x$consequence) > 1) {
  if(any(unique(x$consequence) %in% c("A", "B"))) {
  x %>%
    filter(consequence %in% c("A", "B"))} else {return(x)}
     } else {return(x)}
}


df %>%
group_by(id) %>%
consequence_select

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B

I was able to do this correctly with plyr:

ddply(df, .(id), consequence_select)

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B
5  2           C
6  2           C

Upvotes: 3

Views: 229

Answers (3)

talat
talat

Reputation: 70266

You could optimize your code by applying it only inside a filter argument and not inside a do as filter is the specialized dplyr function for such a task. I created two functions and benchmarked them with the existing answers. Which function you want to use depends on your requirements - for the sample data, they both produce the same result. I also created a slightly larger sample data for the benchmark, as below.

# sample data
df <- data.frame(id = sample(100, 1000, replace = T), 
                 consequence = sample(LETTERS[1:3], 1000, replace = TRUE, prob = c(0.2, 0.2, 0.6)))

# the existing custom function
consequence_select <- function(x) {
  if(n_distinct(x$consequence) > 1) {
    if(any(unique(x$consequence) %in% c("A", "B"))) {
      x %>%
        filter(consequence %in% c("A", "B"))} else {return(x)}
  } else {return(x)}
}

# eipi's answer
f1 <- function() {
  df %>%
  group_by(id) %>%
  do(consequence_select(.)) }

# jazzuro's answer
f2 <- function() {
  df %>%
  group_by(id) %>%
  do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]}) }

# my answer 1
f3a <- function() {
  df %>% 
    group_by(id) %>% 
    filter((consequence != "C" & n_distinct(consequence) > 1L) | all(consequence == "C") )
}

# my answer 2
f3b <- function() {
  df %>% 
    group_by(id) %>% 
    filter((consequence %in% c("A", "B") & n_distinct(consequence) > 1L) | all(consequence == "C"))
}

library(microbenchmark)

microbenchmark(f1(), f2(), f3a(), f3b(), unit = "relative")

Unit: relative
 expr       min        lq    median        uq       max neval
f1()  11.243524 11.092915 10.956129 10.717519  8.859949   100
f2()   6.603549  6.663674  6.653424  6.566012 10.956784   100
f3a()  1.279952  1.294679  1.291719  1.294606  1.165322   100
f3b()  1.000000  1.000000  1.000000  1.000000  1.000000   100

all.equal(f1(), f3a())
#[1] TRUE
all.equal(f1(), f3b())
#[1] TRUE

As you can see, a slight increase in data size already reveals a >10 times speed difference between the functions.

Upvotes: 4

jazzurro
jazzurro

Reputation: 23574

You can do your function like this using do. foo is your data.

foo %>%
    group_by(id) %>%
    do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]})

#  id consequence
#1  1           A
#2  1           A
#3  1           B
#4  1           B
#5  2           C
#6  2           C

Upvotes: 3

eipi10
eipi10

Reputation: 93811

With dplyr you need to wrap the function in do:

df %>%
  group_by(id) %>%
  do(consequence_select(.))

The . is a "pronoun" that refers to the data frame df.

Upvotes: 4

Related Questions