andschar
andschar

Reputation: 3973

Multiple first and last non-NA values by group

I have the following data.table:

require(data.table)
dt = data.table(
  id = c(rep('Grp 1', 31), rep('Grp 2', 31)),
  date = rep(as.IDate(as.IDate('2020-01-01') : as.IDate('2020-01-31')), 2),
  change = c(rep(NA, 5), rep('yes', 5), rep(NA, 10), rep('yes', 3), rep(NA, 8),
             rep(NA, 2), rep('yes', 8), rep(NA, 8), rep('yes', 5), rep(NA, 8))
)

For every group id I'd like to filter the first and last dates of a series, which is defined by a second column change being yes (i.e. non-NA). I can do the following, which would provide me with the first and last non-NA row by group. However, the problem is that the series occurs more than once per group.

dt[ !is.na(change),
    .(head(date, 1),
      tail(date, 1)),
    .(id) ]

These are the row indices I'd like to have filtered:

dt[c(6,10,21,23,34,41,50,54)]

Upvotes: 0

Views: 94

Answers (2)

ekoam
ekoam

Reputation: 8844

One way is to give a unique group id to each streak identified by an id and change combination. We can use rleid to generate such run-length type ids. Consider something like this

dt[,
   gid := rleid(id, change)
][!is.na(change),
  as.list(range(date)),
  by = .(id, gid)
][, 
  gid := NULL
]

Note that I also assume that you want the range of dates, not really the first and last elements. Your method will fail if the dates are not in chronological order. Output looks like this

      id         V1         V2
1: Grp 1 2020-01-06 2020-01-10
2: Grp 1 2020-01-21 2020-01-23
3: Grp 2 2020-01-03 2020-01-10
4: Grp 2 2020-01-19 2020-01-23

rleid works like this

> rleid(c(1, 1, 2, 3, 3), c("a", "b", "b", "d", "d"))
[1] 1 2 3 4 4

Upvotes: 1

akrun
akrun

Reputation: 886938

Here is an option with dplyr

library(dplyr)
library(data.table)
dt %>%
     group_by(grp = rleid(id, change), id) %>% 
     filter(!is.na(change)) %>%
     summarise(V1 = min(date, na.rm = TRUE),
               V2 = max(date, na.rm = TRUE), .groups = 'drop') 

-output

# A tibble: 4 x 4
#    grp id    V1         V2        
#  <int> <chr> <date>     <date>    
#1     2 Grp 1 2020-01-06 2020-01-10
#2     4 Grp 1 2020-01-21 2020-01-23
#3     7 Grp 2 2020-01-03 2020-01-10
#4     9 Grp 2 2020-01-19 2020-01-23

Upvotes: 1

Related Questions