Reputation: 8335
I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.
Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.
set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
The best I've got so far with dplyr is the following (but this loses the identity variables):
df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))
output:
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10
I was expecting to be able to do this:
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))
but this returns the error:
Error: comparison (6) is possible only for atomic and list types
What is the dplyr way of doing this?
Upvotes: 9
Views: 10664
Reputation: 7435
Another slightly different approach using dplyr
:
result <- df1 %>% group_by(id) %>%
summarise(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
### A tibble: 3 x 2
## id count
## <fctr> <int>
##1 A 17
##2 B 18
##3 C 10
I believe you can use length
and which
to count the number of occurrences for which your condition is true for each id
in df1
. Then, group by id
and use this to summarise
.
If there are possibly more that one start_time
per id
, then you can use the same function but rowwise
and with mutate
:
result <- df1 %>% rowwise() %>%
mutate(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
## id start_time count
## <fctr> <dbl> <int>
##1 A 10 17
##2 B 20 18
##3 C 30 10
Upvotes: 2
Reputation: 887128
Here is one option with data.table
where we can use the non-equi
joins
library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew),
by = .EACHI][, c('id', 'N'), with = FALSE]
# id N
#1: A 17
#2: B 18
#3: C 10
which gives the same count as in the OP's base R
method
sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10
If we need the 'id' variable also as output in dplyr
, we can modify the OP's code
df1 %>%
rowwise() %>%
do(data.frame(., count = filter(df2, time>.$start_time,
time < (.$start_time + 15)))) %>%
group_by(id) %>%
summarise(n = n())
# id n
# <fctr> <int>
#1 A 17
#2 B 18
#3 C 10
Or another option is map
from purrr
with dplyr
library(purrr)
df1 %>%
split(.$id) %>%
map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>%
select(-start_time)
# id N
#1 A 17
#2 B 18
#3 C 10
Upvotes: 6