Reputation: 16478
I have some longitudinal data.table
about matches, which contains matches between different A
and B
, and a flow of payment between these.
A year B payment start_global end_global
1: 51557094 2002 65122111 80.39000 TRUE FALSE
2: 51557094 2003 65122111 9.74000 FALSE FALSE
3: 51557094 2004 65122111 7.85000 FALSE FALSE
4: 51557094 2005 65122111 97.16000 FALSE FALSE
5: 51557094 2006 65122111 48.22000 FALSE FALSE
6: 51557094 2007 65122111 91.24000 FALSE FALSE
7: 51557094 2008 65122111 9.35000 FALSE FALSE
8: 51557094 2009 65122111 13.15000 FALSE FALSE
9: 51557094 2010 65122111 3.46000 FALSE TRUE
10: 51557133 1998 65142845 60.43981 TRUE FALSE
11: 51557133 1999 65142845 111.60000 FALSE TRUE
12: 51557133 1997 65224333 21.03455 TRUE TRUE
13: 51557133 2000 65224333 144.17000 TRUE FALSE
14: 51557133 2001 65224333 102.52000 FALSE FALSE
15: 51557133 2002 65224333 5.79000 FALSE FALSE
16: 51557133 2003 65224333 8.48000 FALSE FALSE
17: 51557133 2004 65224333 68.16000 FALSE FALSE
18: 51557133 2005 65224333 29.36000 FALSE TRUE
I have already added the indicators start_global
and end_global
, which indicate where a match starts and where it ends (based on whether there is a connection for the specific A-B
link in the given next and previous year.
I now need to compute match length and average payments for each A-B
link. That is, my expected output would be something like
A B payment start end
1 51557094 65122111 40.06222 2002 2010
In pandas
, I would just do a simple groupby
and do the calculations there. How would I proceed in R
?
That there could be several matches between the same A
, B
in my data, which were terminated in-between. If that's the case, I want to have several matches for each (start_global
, end_global
) pair.
That is, say I had the following data:
13: 51557133 2000 65224333 144.17000 TRUE FALSE
14: 51557133 2001 65224333 102.52000 FALSE TRUE
16: 51557133 2003 65224333 8.48000 TRUE FALSE
17: 51557133 2004 65224333 68.16000 FALSE FALSE
18: 51557133 2005 65224333 29.36000 FALSE TRUE
I would want this to become
A B payment start end
1 51557133 65224333 123.34500 2000 2001
2 51557133 65224333 35.33333 2003 2005
and not
A B payment start end
1 51557133 65224333 70.538 2000 2005
I'm going to be using this on a secured server where installation of additional packages is very cumbersome and almost-not-impossible. I already have plyr
on that server and data.table
, if there's a way to do this without installing additional packages, I'd prefer that by a huge margin.
For completeness, here's the list of packages allowed:
MASS devtools gtable munsell reshape2
RColorBrewer dichromat haven packrat rstudio
Rcpp digest labeling plyr scales
colorspace foreign mFilter proto stringr
data.table ggplot2 manipulate reshape yaml
Upvotes: 1
Views: 53
Reputation: 23014
Using base R:
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
cbind(
aggregate(payment~A+B+startCount, mean, data = df1)[, -3],
start = aggregate(year~A+B+startCount, min, data = df1)[, 4],
end = aggregate(year~A+B+startCount, max, data = df1)[, 4]
)
Using the dplyr package:
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(startCount = cumsum(ifelse(start_global==TRUE,1,0))) %>%
group_by(A, B, startCount) %>%
summarise(
payment = mean(payment),
start = min(year),
end = max(year)
) %>%
select(-startCount)
Using data.table:
library(data.table)
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
result <- df1[, j = list(payment = mean(payment), start = min(year), end = max(year)), by = list(A,B,startCount)]
result[, startCount:=NULL]
Output:
Source: local data table [4 x 5]
Groups: A, B
A B payment start end
1 51557094 65122111 40.06222 2002 2010
2 51557133 65142845 86.01990 1998 1999
3 51557133 65224333 21.03455 1997 1997
4 51557133 65224333 59.74667 2000 2005
Benchmarking
data.table
is the fastest by far:
Unit: milliseconds
expr min lq mean median uq max neval
BASE 5.808398 22.212135 32.391813 26.293450 34.08702 325.40491 1000
DPLYR 4.352663 17.011435 25.892872 20.931953 27.37157 177.39900 1000
DATATABLE 1.067853 4.139477 6.326194 4.987943 6.75672 85.24855 1000
Data used:
df1 <-
structure(list(A = c(51557094L, 51557094L, 51557094L, 51557094L,
51557094L, 51557094L, 51557094L, 51557094L, 51557094L, 51557133L,
51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 51557133L,
51557133L, 51557133L), year = c(2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 1998L, 1999L, 1997L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L), B = c(65122111L, 65122111L, 65122111L,
65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 65122111L,
65142845L, 65142845L, 65224333L, 65224333L, 65224333L, 65224333L,
65224333L, 65224333L, 65224333L), payment = c(80.39, 9.74, 7.85,
97.16, 48.22, 91.24, 9.35, 13.15, 3.46, 60.43981, 111.6, 21.03455,
144.17, 102.52, 5.79, 8.48, 68.16, 29.36), start_global = c(TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), end_global = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)), .Names = c("A",
"year", "B", "payment", "start_global", "end_global"), class = c("data.table",
"data.frame"), row.names = c(NA, -18L))
Upvotes: 2