How to aggregate data based on consecutive row values?

Question

I am doing data analysis of photos of animals from trail cameras. My data includes what camera a picture was taken with, the date and time the picture was taken, and the animal in the photo. I wish to aggregate my data based on the time animals spent in front of the camera. For our purposes, an encounter is anytime we photograph an animal more than 10 minutes after photographing another of the same species. Encounters can be more than 10 minutes long in some cases, such as if we took 3 pictures of the same animal 7 minutes apart from one another, a 21 minute encounter. I want my output to aggregate my data into individual encounters for all animals photographed, and include start times and end times for each encounter photo series.

My code thus far

library(dplyr)

#Data 
df <- structure(list(camera_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L), date = c("11-May-21", "11-May-21", "11-May-21", 
"15-May-21", "15-May-21", "10-May-21", "10-May-21", "12-May-21", 
"12-May-21", "12-May-21", "12-May-21", "12-May-21", "13-May-21", 
"13-May-21"), time = c("5:23:46", "5:23:50", "5:32:34", "9:35:20", 
"9:35:35", "23:11:16", "23:11:17", "11:06:08", "11:15:09", "11:24:10", 
"2:04:01", "2:04:03", "1:15:00", "1:15:50"), organism = c("mouse", 
"mouse", "bird", "squirrel", "squirrel", "mouse", "mouse", "woodchuck", 
"woodchuck", "woodchuck", "mouse", "mouse", "mouse", "mouse")), class = "data.frame", row.names = c(NA, 
-14L))

#Combining date and time
df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")

#Time differences in minutes, based on organism
df <- df %>% group_by(organism) %>%
  mutate(timediff = (datetime - lag(datetime))/60
  )

#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)

#Make negative and NA values = 0. Negative values appear when going from one camera to the next. R thinks it is going back in time, rather than 
#swapping cameras
df$timediff[df$timediff<0] <- 0
df$timediff[is.na(df$timediff)] <- 0

At this point, I want to use timediff as my condition for aggregation, and aggregate any subsequent rows of data with a timediff < 10, as long as the row has the same camera_id and organism. I've been trying different dplyr approaches but havent been able to crack this. The output should look like this.

structure(list(camera_id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), start_datetime = c("5/11/2021 5:23", 
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:06", 
"5/12/2021 2:04", "5/13/2021 1:15"), end_datetime = c("5/11/2021 5:23", 
"5/11/2021 5:32", "5/15/2021 9:35", "5/10/2021 23:11", "5/10/2021 11:24", 
"5/12/2021 2:04", "5/13/2021 1:15"), organism = c("mouse", "bird", 
"squirrel", "mouse", "woodchuck", "mouse", "mouse"), encounter_time = c("0:00:04", 
"0:00:00", "0:00:15", "0:00:01", "0:18:00", "0:00:02", "0:00:50"
)), class = "data.frame", row.names = c(NA, -7L))

Mako212 · Accepted Answer

I think this gets you your desired result:

A couple key changes: when we calculate timediff it makes sense to group by camera_id in addition to organism, since that grouping persists throughout.

Then, we need to create some helper columns to generate our grouping based on the 10 second condition.

under_10 is 0 for all values of timediff less than 10, and also when timediff is NA (when a row is the first within the group). Under 10 is 1 when timelapsed > 10.

Then we create a grouping variable that increments when time elapsed is > 10. Then we simply summarize, calculating start and end based on min/max datetimes, and remove the grouping column.

library(tidyverse)

df$datetime <- as.POSIXct(paste(df$date, df$time), format ="%d-%B-%y %H:%M:%S")

#Time differences in minutes, based on organism
df <- df %>% group_by(organism, camera_id) %>%
  mutate(timediff = (datetime - lag(datetime))/60
  )

#Round minutes to 2 decimal points
df$timediff <- round(df$timediff, digits=2)

df %>% mutate(under_10 = ifelse(timediff < 10 | is.na(timediff), 0, 1)) %>% 
  arrange(camera_id, datetime) %>%
  mutate(grouping = cumsum(under_10)) %>%
  group_by(camera_id, organism, grouping) %>% 
  summarize(start_datetime = min(datetime), end_datetime = max(datetime),
    encounter_time = end_datetime-start_datetime) %>%
  select(-grouping)



camera_id organism  start_datetime      end_datetime        encounter_time
                                                
1         1 bird      2021-05-11 05:32:34 2021-05-11 05:32:34    0 secs     
2         1 mouse     2021-05-11 05:23:46 2021-05-11 05:23:50    4 secs     
3         1 squirrel  2021-05-15 09:35:20 2021-05-15 09:35:35   15 secs     
4         2 mouse     2021-05-10 23:11:16 2021-05-10 23:11:17    1 secs     
5         2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 1082 secs     
6         3 mouse     2021-05-12 02:04:01 2021-05-12 02:04:03    2 secs     
7         3 mouse     2021-05-13 01:15:00 2021-05-13 01:15:50   50 secs

Also, if you'd rather have the H:MM:SS format for encounter_time you can get here like this, adding the following after the summarize call in the above code:

library(lubridate)
  ...
  mutate(encounter_time = seconds_to_period(as.character(encounter_time))) %>%
  select(-grouping) %>%
  mutate(encounter_time = sprintf("%1i:%02i:%02i", 
    lubridate::hour(encounter_time), 
    lubridate::minute(encounter_time), 
    lubridate::second(encounter_time)))

  camera_id organism  start_datetime      end_datetime        encounter_time
                                                 
1         1 bird      2021-05-11 05:32:34 2021-05-11 05:32:34 0:00:00       
2         1 mouse     2021-05-11 05:23:46 2021-05-11 05:23:50 0:00:04       
3         1 squirrel  2021-05-15 09:35:20 2021-05-15 09:35:35 0:00:15       
4         2 mouse     2021-05-10 23:11:16 2021-05-10 23:11:17 0:00:01       
5         2 woodchuck 2021-05-12 11:06:08 2021-05-12 11:24:10 0:18:02       
6         3 mouse     2021-05-12 02:04:01 2021-05-12 02:04:03 0:00:02       
7         3 mouse     2021-05-13 01:15:00 2021-05-13 01:15:50 0:00:50

However you end up with encounter_time stored as character, so that may or may not be useful

How to aggregate data based on consecutive row values?

Answers (1)

Related Questions