Stephen Condor
Stephen Condor

Reputation: 241

Combine rows with consecutive dates into single row with start and end dates

I have a dataframe of events that looks something like this:

EVENT     DATE       LONG    LAT    TYPE     
1         1/1/2000   23      45     A
2         2/1/2000   23      45     B
3         3/1/2000   23      45     B
3         5/2/2000   22      56     A
4         6/2/2000   19      21     A

I'd like to collapse this so that any events that occur on consecutive days at the same location (as defined by LONG, LAT) are collapsed into a single event with a START and END date and a concatenated column of the TYPES involved.

Thus the above table would become:

EVENT     START-DATE    END-DATE    LONG    LAT    TYPE     
1         1/1/2000      3/1/2000    23      45     ABB
2         5/2/2000      5/2/2000    22      56     A
3         6/2/2000      6/2/2000    19      21     A

Any advice on how to best approach this would be greatly appreciated.

Upvotes: 3

Views: 2295

Answers (2)

Z.Lin
Z.Lin

Reputation: 29075

Here's a modified version of Ronak Shah's solution, taking non-consecutive events at the same location as separate event periods.

# expanded data sample
df <- data.frame(
  DATE = as.Date(c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-05",
                   "2000-02-05", "2000-02-06", "2000-02-07"), format = "%Y-%m-%d"),
  LONG = c(23, 23, 23, 23, 22, 19, 22),
  LAT = c(45, 45, 45, 45, 56, 21, 56),
  TYPE = c("A", "B", "B", "A", "A", "B", "A")
)

library(dplyr)

df %>%
  group_by(LONG, LAT) %>%
  arrange(DATE) %>%
  mutate(DATE.diff = c(1, diff(DATE))) %>%
  mutate(PERIOD = cumsum(DATE.diff != 1)) %>%
  ungroup() %>%
  group_by(LONG, LAT, PERIOD) %>%
  summarise(START_DATE = min(DATE),
            END_DATe = max(DATE), 
            TYPE = paste(TYPE, collapse = "")) %>%
  ungroup()

# A tibble: 5 x 6
   LONG   LAT PERIOD START_DATE   END_DATe  TYPE
  <dbl> <dbl>  <int>     <date>     <date> <chr>
1    19    21      0 2000-02-06 2000-02-06     B
2    22    56      0 2000-02-05 2000-02-05     A
3    22    56      1 2000-02-07 2000-02-07     A
4    23    45      0 2000-01-01 2000-01-03   ABB
5    23    45      1 2000-01-05 2000-01-05     A

Edit to add explanation for what's going on with the "PERIOD" variable.

For simplicity, let's consider some sequential consecutive & non-consecutive events at the same location, so we can skip the group_by(LONG, LAT) & arrange(DATE) steps:

# sample dataset of 10 events at the same location. 
# first 3 are on consecutive days, next 2 are on consecutive days,
# next 4 are on consecutive days, & last 1 is on its own.
df2 <- data.frame(
  DATE = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03", 
                   "2001-01-05", "2001-01-06",
                   "2001-02-01", "2001-02-02", "2001-02-03", "2001-02-04",
                   "2001-04-01"), format = "%Y-%m-%d"),
  LONG = rep(23, 10),
  LAT = rep(45, 10),
  TYPE = LETTERS[1:10]
)

As an intermediate step, we create some helper variables:

  1. "DATE.diff" counts the difference between current row's date & previous row's date. Since the first row has no date before "2001-01-01", we default the difference to 1.

  2. "non.consecutive" indicates whether the calculated date difference is not 1 (i.e. not consecutive from previous day), or 1 (i.e. consecutive from previous day). If you need to account for same-day events at the same location in the dataset, you can change the calculation from DATE.diff != 1 to DATE.diff > 1 here.

  3. "PERIOD" keeps track of the number of TRUE results in the "non.consecutive" variable. Starting from the first row, every time a row's is non-consecutive from the previous row, "PERIOD" increments by 1.

As a result of the helper variables, "PERIOD" takes on a different value for each group of consecutive dates.

df2.intermediate <- df2 %>%
  mutate(DATE.diff = c(1, diff(DATE))) %>%
  mutate(non.consecutive = DATE.diff != 1) %>%
  mutate(PERIOD = cumsum(non.consecutive))

> df2.intermediate
         DATE LONG LAT TYPE DATE.diff non.consecutive PERIOD
1  2001-01-01   23  45    A         1           FALSE      0
2  2001-01-02   23  45    B         1           FALSE      0
3  2001-01-03   23  45    C         1           FALSE      0
4  2001-01-05   23  45    D         2            TRUE      1
5  2001-01-06   23  45    E         1           FALSE      1
6  2001-02-01   23  45    F        26            TRUE      2
7  2001-02-02   23  45    G         1           FALSE      2
8  2001-02-03   23  45    H         1           FALSE      2
9  2001-02-04   23  45    I         1           FALSE      2
10 2001-04-01   23  45    J        56            TRUE      3

We can then treat "PERIOD" as a grouping variable in order to find the start / end date & events within each period:

df2.intermediate %>%
  group_by(PERIOD) %>%
  summarise(START_DATE = min(DATE),
            END_DATe = max(DATE), 
            TYPE = paste(TYPE, collapse = "")) %>%
  ungroup()

# A tibble: 4 x 4
  PERIOD START_DATE   END_DATe  TYPE
   <int>     <date>     <date> <chr>
1      0 2001-01-01 2001-01-03   ABC
2      1 2001-01-05 2001-01-06    DE
3      2 2001-02-01 2001-02-04  FGHI
4      3 2001-04-01 2001-04-01     J

Upvotes: 7

Ronak Shah
Ronak Shah

Reputation: 388817

With dplyr, we can group by LAT and LONG and select the maximum and minimum DATE for each group and paste the TYPE column together.

library(dplyr)
df %>%
   group_by(LONG, LAT) %>%
   summarise(start_date = min(as.Date(DATE, "%d/%m/%Y")), 
             end_date = max(as.Date(DATE, "%d/%m/%Y")), 
             type = paste0(TYPE, collapse = ""))



#   LONG   LAT start_date   end_date  type
#  <int> <int>     <date>     <date> <chr>
#1    19    21 2000-02-06 2000-02-06     A
#2    22    56 2000-02-05 2000-02-05     A
#3    23    45 2000-01-01 2000-01-03   ABB

Upvotes: 3

Related Questions