Magasinus
Magasinus

Reputation: 83

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range? Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.

My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.

dt[, date := as.Date(date)][
  , totalids := sapply(date, 
   function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))), 
   by = country]

Data

library(data.table)
library(lubridate)

ID    <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)

dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)

Thanks in advance.

Upvotes: 1

Views: 694

Answers (3)

r2evans
r2evans

Reputation: 160407

This data.table-only answer is motivated by a comment,

dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
  ][, results := dt[dt, on = .(country, date > date8, date <= date), 
                    length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
  ][, date8 := NULL ]
#         id       date value country desired_output desired_unrestricted results
#     <char>     <Date> <num>  <char>          <num>                <num>   <int>
#  1:      1 2017-01-01     2      UK              1                    2       1
#  2:      1 2017-01-01     4      UK              1                    2       1
#  3:      1 2017-01-05     3      US              0                    1       0
#  4:      1 2017-05-01     5      US              1                    3       2
#  5:      1 2017-05-01     2    <NA>             NA                   NA       0
#  6:      1 2018-05-02     5      US              0                    1       0
#  7:      2 2017-01-01     8      UK              1                    2       1
#  8:      2 2017-01-05    17      UK              2                    2       2
#  9:      2 2017-05-01    17      US              1                    3       2
# 10:      3 2017-05-01     3      US              2                    3       2
# 11:      3 2017-05-01     7      US              2                    3       2
# 12:   <NA> 2017-12-12     5      US              2                    4       1
# 13:      4 2017-12-12     3      US              2                    4       1

That's a lot to absorb.

Quick walk-through:

  • "8 months ago":

    seq(z, length=2, by="-8 months")[2]
    

    seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.

  • Date subtraction:

    [, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
    

    A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)

    We need that in dt first since we do a non-equi (range-based) join based on this value.

  • Counting id with 2 or more visits:

    length(Filter(function(z) z > 1, table(id)))
    

    We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.

  • Self non-equi join:

    dt[dt, on = .(country, date > date8, date <= date), ... ]
    

    Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.

  • Self non-equi join but count ids by-row: by=.EACHI.

  • Retrieve the results of that and assign into the original dt:

    [, results := dt[...]$V1 ]
    

    Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)

  • Cleanup:

    [, date8 := NULL ]
    

    (Nothing fancy here, just proper data-stewardship :-)

There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...

Upvotes: 3

TimTeaFan
TimTeaFan

Reputation: 18541

Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.

library(dplyr)
library(lubridate)

dt <- dt %>% mutate(date = as.Date(date))

dt %>% 
  group_by(country) %>% 
  group_modify(~ .x %>% 
  rowwise() %>% 
  mutate(totalids = .x %>%
           filter(date <= .env$date, date >= .env$date %m-% months(8)) %>% 
           pull(id) %>% 
           table() %>% 
           `[`(. >1) %>% 
           length
  )) 

#> # A tibble: 13 x 7
#> # Groups:   country [3]
#>    country id    date       value desired_output desired_unrestricted totalids
#>    <chr>   <chr> <date>     <dbl>          <dbl>                <dbl>    <int>
#>  1 UK      1     2017-01-01     2              1                    2        1
#>  2 UK      1     2017-01-01     4              1                    2        1
#>  3 UK      2     2017-01-01     8              1                    2        1
#>  4 UK      2     2017-01-05    17              2                    2        2
#>  5 US      1     2017-01-05     3              0                    1        0
#>  6 US      1     2017-05-01     5              1                    3        2
#>  7 US      1     2018-05-02     5              0                    1        0
#>  8 US      2     2017-05-01    17              1                    3        2
#>  9 US      3     2017-05-01     3              2                    3        2
#> 10 US      3     2017-05-01     7              2                    3        2
#> 11 US      <NA>  2017-12-12     5              2                    4        1
#> 12 US      4     2017-12-12     3              2                    4        1
#> 13 <NA>    1     2017-05-01     2             NA                   NA        0

Created on 2021-09-02 by the reprex package (v2.0.1)

Upvotes: 2

chinsoon12
chinsoon12

Reputation: 25225

Here is another option:

setkey(dt, country, date, id)
dt[, date := as.IDate(date)][, 
    eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]

dt[, c("out", "out_unres") := 
    dt[dt, on=.(country, date>=eightmthsago, date<=date), 
        by=.EACHI, {
                v <- id[!is.na(id)]
                .(uniqueN(v[duplicated(v)]), uniqueN(v))
            }][,1L:3L := NULL]
]
dt

output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):

      id       date value country desired_output desired_unrestricted eightmthsago out out_unres
 1:    1 2017-05-01     2    <NA>             NA                   NA   2016-09-01   0         1
 2:    1 2017-01-01     2      UK              1                    2   2016-05-01   1         2
 3:    1 2017-01-01     4      UK              1                    2   2016-05-01   1         2
 4:    2 2017-01-01     8      UK              1                    2   2016-05-01   1         2
 5:    2 2017-01-05    17      UK              2                    2   2016-05-05   2         2
 6:    1 2017-01-05     3      US              0                    1   2016-05-05   0         1
 7:    1 2017-05-01     5      US              1                    3   2016-09-01   2         3
 8:    2 2017-05-01    17      US              1                    3   2016-09-01   2         3
 9:    3 2017-05-01     3      US              2                    3   2016-09-01   2         3
10:    3 2017-05-01     7      US              2                    3   2016-09-01   2         3
11: <NA> 2017-12-12     5      US              2                    4   2017-04-12   1         4
12:    4 2017-12-12     3      US              2                    4   2017-04-12   1         4
13:    1 2018-05-02     5      US              0                    1   2017-09-02   0         2

Upvotes: 2

Related Questions