R data table interval calculation by column

Question

I need to calculate unique ids within different intervals (3,4,5,6 months...) by for each month. I need to do that for different groups as well such as age, gender etc. This is how my data looks like:

ID Yr_month Age Gender
11 2012-01  30  M
11  2012-02 30  M
...
11  2012-12 30 M
12  2012-01 32 F...

The output should look like this:

Yr_month cnt_distinctID_3 count_distinctID_4....
2012-01   300             400

I am able to do this using multiple for loops and dplyr. Is there a faster way using data table to get this done? Thanks!

This is how my code looks like:

setorderv(test,c("id","year_mth"))
setkeyv(test,c("id"))
test <- data.table(cbind(test, first=0L))
test[test[unique(test),,mult="first", which=TRUE], first:=1L]
test1 <- test %>% 
  group_by(year_mth) %>%

 summarize(first_total = sum(first)) %>% 
  select(year_mth,first_total) 
test2 <- test1 %>% 

  arrange(year_mth) %>% 
  mutate(Cusum = cumsum(first_total)) %>% 
select(year_mth, Cusum)

Then I am running for loop by year_mth and K<- seq(3:36) on the above. Its taking a lot of time as I am running a big dataset.

Uwe · Accepted Answer

If I understand the question correctly, the OP wants to count unique IDs in rolling windows of varying sizes. The counts are to be presented in a table where the length of the rolling window runs horizontally and the ending month of the rolling window vertically.

This approach creates all intervalls as a data.table and aggregates during a non-equi join with the dataset. Finally, the results are reshaped from long to wide format.

Creating a sample dataset

The OP has not provided a sample dataset. So, we have to make up our own:

# create year-month sequence
yr_m <- CJ(2012:2014, 1:12)[, sprintf("%4i-%02i", V1, V2)]
n_id <- 100L   # number of individual IDs
n_row <- 1e3L  # number of rows to create
set.seed(123L)   # required for reproducible results
DT <- data.table(ID = sample.int(n_id, n_row, TRUE),
                 Yr_month = ordered(sample(yr_m, n_row, TRUE), yr_m))
str(DT)

Classes ‘data.table’ and 'data.frame':    1000 obs. of  2 variables:
 $ ID      : int  29 79 41 89 95 5 53 90 56 46 ...
 $ Yr_month: Ord.factor w/ 36 levels "2012-01"<"2012-02"<..: 10 22 6 31 31 18 28 11 3 16 ...
 - attr(*, ".internal.selfref")=

Note that Yr_month has turned into a factor which is required for the subsequent non-equi join which involves comparison operations.

Create intervals

intervals <- rbindlist(
  lapply(3:24, function(x) data.table(K = x, 
                                      start = head(yr_m, -(x - 1L)), 
                                      end = tail(yr_m, -(x - 1L)))
  ))

For illustration, only intervals of 3 to 24 months length are considered here.

intervals

      K   start     end
  1:  3 2012-01 2012-03
  2:  3 2012-02 2012-04
  3:  3 2012-03 2012-05
  4:  3 2012-04 2012-06
  5:  3 2012-05 2012-07
 ---                   
513: 24 2012-09 2014-08
514: 24 2012-10 2014-09
515: 24 2012-11 2014-10
516: 24 2012-12 2014-11
517: 24 2013-01 2014-12

Aggregate during non-equi join and reshape

DT[intervals, on = .(Yr_month >= start, Yr_month <= end), 
   .(count = uniqueN(ID), end, K), by = .EACHI][
     , dcast(.SD, end ~ K, value.var = "count")]

        end  3  4  5  6  7  8  9 10 11 12 13 14  15  16  17  18  19  20  21  22  23  24
 1: 2012-03 59 NA NA NA NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 2: 2012-04 53 64 NA NA NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 3: 2012-05 59 69 80 NA NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 4: 2012-06 57 72 78 88 NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 5: 2012-07 53 62 75 80 89 NA NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 6: 2012-08 50 65 71 81 86 91 NA NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 7: 2012-09 58 65 71 76 84 89 93 NA NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 8: 2012-10 59 67 72 77 82 88 92 94 NA NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 9: 2012-11 57 66 72 77 82 86 91 94 96 NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
10: 2012-12 57 67 75 80 83 88 91 95 97 98 NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
11: 2013-01 53 63 71 78 83 85 90 93 97 98 99 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
12: 2013-02 57 68 77 82 87 91 92 95 97 97 98 99  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
13: 2013-03 56 67 75 83 86 88 92 93 96 97 97 98  99  NA  NA  NA  NA  NA  NA  NA  NA  NA
14: 2013-04 57 67 76 81 87 90 92 95 96 98 99 99 100 100  NA  NA  NA  NA  NA  NA  NA  NA
15: 2013-05 65 74 79 83 86 90 93 95 97 98 99 99  99 100 100  NA  NA  NA  NA  NA  NA  NA
16: 2013-06 71 77 83 85 87 89 92 95 97 98 99 99  99  99 100 100  NA  NA  NA  NA  NA  NA
17: 2013-07 65 78 83 88 90 91 91 94 96 97 98 99  99  99  99 100 100  NA  NA  NA  NA  NA
18: 2013-08 57 73 84 88 91 93 94 94 97 99 99 99 100 100 100 100 100 100  NA  NA  NA  NA
19: 2013-09 62 71 81 90 92 95 96 96 96 97 99 99  99 100 100 100 100 100 100  NA  NA  NA
20: 2013-10 62 71 79 87 93 95 98 98 98 98 98 99  99  99 100 100 100 100 100 100  NA  NA
21: 2013-11 61 74 81 87 91 95 96 99 99 99 99 99 100 100 100 100 100 100 100 100 100  NA
22: 2013-12 64 76 83 88 93 96 98 99 99 99 99 99  99 100 100 100 100 100 100 100 100 100
23: 2014-01 56 70 78 84 89 94 96 98 99 99 99 99  99  99 100 100 100 100 100 100 100 100
24: 2014-02 52 67 76 83 88 90 95 96 98 99 99 99  99  99  99 100 100 100 100 100 100 100
25: 2014-03 51 62 72 80 85 89 91 95 96 98 99 99  99  99  99  99 100 100 100 100 100 100
26: 2014-04 58 62 71 76 83 87 90 92 96 97 99 99  99  99  99  99  99 100 100 100 100 100
27: 2014-05 60 67 70 78 82 88 90 92 94 97 98 99  99  99  99  99  99  99 100 100 100 100
28: 2014-06 58 74 78 80 85 88 93 93 94 94 97 98  99  99  99  99  99  99  99 100 100 100
29: 2014-07 60 70 81 83 85 88 90 94 94 95 95 98  99 100 100 100 100 100 100 100 100 100
30: 2014-08 64 71 79 89 91 91 93 94 96 96 96 96  99  99 100 100 100 100 100 100 100 100
31: 2014-09 57 68 74 82 92 94 94 94 95 96 96 96  96  99  99 100 100 100 100 100 100 100
32: 2014-10 57 67 74 79 87 96 97 97 97 97 98 98  98  98 100 100 100 100 100 100 100 100
33: 2014-11 48 63 71 77 82 89 97 98 98 98 98 99  99  99  99 100 100 100 100 100 100 100
34: 2014-12 52 61 71 77 82 86 91 99 99 99 99 99  99  99  99  99 100 100 100 100 100 100
        end  3  4  5  6  7  8  9 10 11 12 13 14  15  16  17  18  19  20  21  22  23  24

uniqueN() is a data.table function which is used here to count the number of unique IDs.

R data table interval calculation by column

Answers (1)

Creating a sample dataset

Create intervals

Aggregate during non-equi join and reshape

Related Questions