FooBar
FooBar

Reputation: 16478

Combine annual data to matches

I have some longitudinal data.table about matches, which contains matches between different A and B, and a flow of payment between these.

           A year        B   payment start_global end_global
 1: 51557094 2002 65122111  80.39000         TRUE      FALSE
 2: 51557094 2003 65122111   9.74000        FALSE      FALSE
 3: 51557094 2004 65122111   7.85000        FALSE      FALSE
 4: 51557094 2005 65122111  97.16000        FALSE      FALSE
 5: 51557094 2006 65122111  48.22000        FALSE      FALSE
 6: 51557094 2007 65122111  91.24000        FALSE      FALSE
 7: 51557094 2008 65122111   9.35000        FALSE      FALSE
 8: 51557094 2009 65122111  13.15000        FALSE      FALSE
 9: 51557094 2010 65122111   3.46000        FALSE       TRUE
10: 51557133 1998 65142845  60.43981         TRUE      FALSE
11: 51557133 1999 65142845 111.60000        FALSE       TRUE
12: 51557133 1997 65224333  21.03455         TRUE       TRUE
13: 51557133 2000 65224333 144.17000         TRUE      FALSE
14: 51557133 2001 65224333 102.52000        FALSE      FALSE
15: 51557133 2002 65224333   5.79000        FALSE      FALSE
16: 51557133 2003 65224333   8.48000        FALSE      FALSE
17: 51557133 2004 65224333  68.16000        FALSE      FALSE
18: 51557133 2005 65224333  29.36000        FALSE       TRUE

I have already added the indicators start_global and end_global, which indicate where a match starts and where it ends (based on whether there is a connection for the specific A-B link in the given next and previous year.

I now need to compute match length and average payments for each A-B link. That is, my expected output would be something like

         A        B  payment start  end
1 51557094 65122111 40.06222  2002 2010 

In pandas, I would just do a simple groupby and do the calculations there. How would I proceed in R?

Note: Actual Matches, not just A-B combinations

That there could be several matches between the same A, B in my data, which were terminated in-between. If that's the case, I want to have several matches for each (start_global, end_global) pair.

That is, say I had the following data:

13: 51557133 2000 65224333 144.17000         TRUE      FALSE
14: 51557133 2001 65224333 102.52000        FALSE       TRUE
16: 51557133 2003 65224333   8.48000         TRUE      FALSE
17: 51557133 2004 65224333  68.16000        FALSE      FALSE
18: 51557133 2005 65224333  29.36000        FALSE       TRUE

I would want this to become

         A        B   payment start  end
1 51557133 65224333 123.34500  2000 2001
2 51557133 65224333  35.33333  2003 2005

and not

         A        B payment start  end
1 51557133 65224333  70.538  2000 2005

Note: No dplyr

I'm going to be using this on a secured server where installation of additional packages is very cumbersome and almost-not-impossible. I already have plyr on that server and data.table, if there's a way to do this without installing additional packages, I'd prefer that by a huge margin.

For completeness, here's the list of packages allowed:

MASS        devtools    gtable      munsell     reshape2
RColorBrewer    dichromat   haven       packrat     rstudio
Rcpp        digest      labeling    plyr        scales
colorspace  foreign     mFilter     proto       stringr
data.table  ggplot2     manipulate  reshape     yaml

Upvotes: 1

Views: 53

Answers (1)

Sam Firke
Sam Firke

Reputation: 23014

Using base R:

df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
cbind(
  aggregate(payment~A+B+startCount, mean, data = df1)[, -3],
  start = aggregate(year~A+B+startCount, min, data = df1)[, 4],
  end = aggregate(year~A+B+startCount, max, data = df1)[, 4]
)

Using the dplyr package:

library(dplyr)
df1 %>%
  group_by(A, B) %>%
  mutate(startCount = cumsum(ifelse(start_global==TRUE,1,0))) %>%
  group_by(A, B, startCount) %>%
  summarise(
    payment = mean(payment),
    start = min(year),
    end = max(year)
    ) %>%
  select(-startCount)

Using data.table:

library(data.table)
df1$startCount <- ave(df1$start_global, df1$A, df1$B, FUN = cumsum)
result <- df1[, j = list(payment = mean(payment), start = min(year), end = max(year)), by = list(A,B,startCount)]
result[, startCount:=NULL]

Output:

Source: local data table [4 x 5]
Groups: A, B

         A        B  payment start  end
1 51557094 65122111 40.06222  2002 2010
2 51557133 65142845 86.01990  1998 1999
3 51557133 65224333 21.03455  1997 1997
4 51557133 65224333 59.74667  2000 2005

Benchmarking data.table is the fastest by far:

Unit: milliseconds
      expr      min        lq      mean    median       uq       max neval
      BASE 5.808398 22.212135 32.391813 26.293450 34.08702 325.40491  1000
     DPLYR 4.352663 17.011435 25.892872 20.931953 27.37157 177.39900  1000
 DATATABLE 1.067853  4.139477  6.326194  4.987943  6.75672  85.24855  1000

Data used:

df1 <- 
structure(list(A = c(51557094L, 51557094L, 51557094L, 51557094L, 
51557094L, 51557094L, 51557094L, 51557094L, 51557094L, 51557133L, 
51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 51557133L, 
51557133L, 51557133L), year = c(2002L, 2003L, 2004L, 2005L, 2006L, 
2007L, 2008L, 2009L, 2010L, 1998L, 1999L, 1997L, 2000L, 2001L, 
2002L, 2003L, 2004L, 2005L), B = c(65122111L, 65122111L, 65122111L, 
65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 65122111L, 
65142845L, 65142845L, 65224333L, 65224333L, 65224333L, 65224333L, 
65224333L, 65224333L, 65224333L), payment = c(80.39, 9.74, 7.85, 
97.16, 48.22, 91.24, 9.35, 13.15, 3.46, 60.43981, 111.6, 21.03455, 
144.17, 102.52, 5.79, 8.48, 68.16, 29.36), start_global = c(TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, 
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), end_global = c(FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)), .Names = c("A", 
"year", "B", "payment", "start_global", "end_global"), class = c("data.table", 
"data.frame"), row.names = c(NA, -18L))

Upvotes: 2

Related Questions