James Owers
James Owers

Reputation: 8335

dplyr mutate calling another dataframe

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.

Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.

set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))

The best I've got so far with dplyr is the following (but this loses the identity variables):

df1 %>% 
    rowwise() %>%
    do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
    mutate(n=nrow(count))

output:

Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 × 2
                  count     n
                 <list> <int>
1 <data.frame [17 × 1]>    17
2 <data.frame [18 × 1]>    18
3 <data.frame [10 × 1]>    10

I was expecting to be able to do this:

df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>% 
    group_by(id) %>% 
    mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))

but this returns the error:

Error: comparison (6) is possible only for atomic and list types

What is the dplyr way of doing this?

Upvotes: 9

Views: 10664

Answers (2)

aichao
aichao

Reputation: 7435

Another slightly different approach using dplyr:

result <- df1 %>% group_by(id) %>% 
                  summarise(count = length(which(df2$time > start_time &
                                                 df2$time < (start_time+15))))

print(result)
### A tibble: 3 x 2
##      id count
##  <fctr> <int>
##1      A    17
##2      B    18
##3      C    10

I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.


If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:

result <- df1 %>% rowwise() %>% 
                  mutate(count = length(which(df2$time > start_time & 
                                              df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
##      id start_time count
##  <fctr>      <dbl> <int>
##1      A         10    17
##2      B         20    18
##3      C         30    10

Upvotes: 2

akrun
akrun

Reputation: 887128

Here is one option with data.table where we can use the non-equi joins

library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew), 
           by = .EACHI][, c('id', 'N'), with = FALSE]
#   id  N
#1:  A 17
#2:  B 18
#3:  C 10

which gives the same count as in the OP's base R method

sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10

If we need the 'id' variable also as output in dplyr, we can modify the OP's code

df1 %>%
    rowwise() %>% 
    do(data.frame(., count = filter(df2, time>.$start_time,
                                 time < (.$start_time + 15)))) %>% 
    group_by(id) %>% 
    summarise(n = n())
#      id     n
#  <fctr> <int>
#1      A    17
#2      B    18
#3      C    10

Or another option is map from purrr with dplyr

library(purrr)
df1 %>% 
    split(.$id) %>% 
    map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>% 
    select(-start_time)
#   id  N
#1  A 17
#2  B 18
#3  C 10

Upvotes: 6

Related Questions