How to select specific rows of a DF depending on the time gap between rows and other two conditions related to other variables

Question

I have a data frame df which summarises observations of individuals of a specific animal species. The column DateTime tells you when the animal was seen, the column Observer who saw it, and the column Animal tells you which specific individual (they can be recognised).

df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")

df
              DateTime Observer Animal
1  2016-08-01 12:04:07    Peter      1
2  2016-08-01 12:06:07   Sophie      2
3  2016-08-01 12:06:58    Peter      1
4  2016-08-01 13:12:12    Peter      1
5  2016-08-01 14:04:07   Sophie      2
6  2016-08-01 13:12:45   Sophie      1
7  2016-08-01 15:04:07    Peter      2
8  2016-08-01 17:13:16   Sophie      1
9  2016-08-01 17:21:16   Sophie      2
10 2016-08-01 17:21:34   Sophie      1
11 2016-08-01 17:23:42    Peter      1
12 2016-08-01 17:27:16    Peter      2
13 2016-08-01 17:27:22    Peter      2
14 2016-08-01 17:28:01   Andreu      2
15 2016-08-01 17:29:28   Sophie      1
16 2016-08-01 17:28:08     Anna      2
17 2016-08-01 17:28:15    Peter      2

Due to the methodology for counting animals, the same person CAN'T see the same individual in less than 60 seconds, but another person does.

For a specific purpose, I need to create a df in which every time that someone sees a specific individual, I delete rows in the next 60 seconds of observations of OTHER PEOPLE (if the same person sees the same animal in less than 60 s I delete directly the row. We can see this example in rows 12 and 13), but I add the info of those deleted rows in the columns Other_observers, which summarise the number of other people that saw this animal, and Who, which summarises their names.

What I would like to get is this:

df
              DateTime Observer Ind Other_observers         Who
1  2016-08-01 12:04:07    Peter   1               0          NA
2  2016-08-01 12:06:07   Sophie   2               0          NA
3  2016-08-01 12:06:58    Peter   1               0          NA
4  2016-08-01 13:12:12    Peter   1               1      Sophie
5  2016-08-01 14:04:07   Sophie   2               0          NA
6  2016-08-01 15:04:07    Peter   2               0          NA
7  2016-08-01 17:13:16   Sophie   1               0          NA
8  2016-08-01 17:21:16   Sophie   2               0          NA
9  2016-08-01 17:21:34   Sophie   1               0          NA
10 2016-08-01 17:23:42    Peter   1               0          NA
11 2016-08-01 17:27:16    Peter   2               2 Andreu Anna
12 2016-08-01 17:28:15    Peter   2               0          NA
13 2016-08-01 17:29:28   Sophie   1               0          NA

Does anyone know how to do it?

Clemens Hug · Accepted Answer

I thought about it again a little and I think I have a (also much simpler) solution that doesn't have the limitations we discussed. I added some additional observations to check that edge case.

library(tidyverse)

df <- tribble(
  ~DateTime, ~Observer, ~Animal,
  "2016-08-01 12:04:07",   "Peter",       1,
  "2016-08-01 12:06:07",  "Sophie",       2,
  "2016-08-01 12:06:58",   "Peter",       1,
  "2016-08-01 13:12:12",   "Peter",       1,
  "2016-08-01 14:04:07",  "Sophie",       2,
  "2016-08-01 13:12:45",  "Sophie",       1,
  "2016-08-01 15:04:07",   "Peter",       2,
  "2016-08-01 17:13:16",  "Sophie",       1,
  "2016-08-01 17:21:16",  "Sophie",       2,
  "2016-08-01 17:21:34",  "Sophie",       1,
  "2016-08-01 17:23:42",   "Peter",       1,
  "2016-08-01 17:27:16",   "Peter",       2,
  "2016-08-01 17:27:22",   "Peter",       2,
  "2016-08-01 17:28:01",  "Andreu",       2,
  "2016-08-01 17:29:28",  "Sophie",       1,
  "2016-08-01 17:28:08",    "Anna",       2,
  "2016-08-01 17:28:15",   "Peter",       2,
  "2016-08-01 17:28:17",   "Peter",       2,
  "2016-08-01 17:28:21",   "Peter",       2,
  "2016-08-01 17:28:21",   "Anna",        2,
) %>%
  mutate(DateTime = as.POSIXct(DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC"))

min_diff = as.difftime(60, units = c("secs"))

cumsum_reset <- function(s, x, reset) {
  ns <- s + x
  if (ns > reset) return(0)
  ns
}

df_wrangled <- df %>%
  arrange(DateTime) %>%
  group_by(Animal) %>%
  mutate(
    # Time difference to laste observation of this animal
    Diff = replace_na(DateTime - lag(DateTime, 1), 0),
    # Cumulative time since first observation, resets to 0 when more than `min_diff`
    CumDiff = accumulate(Diff, cumsum_reset, reset = min_diff),
    # Group observations within the `min_diff` period
    ObsGroup = cumsum(CumDiff == 0)
  ) %>%
  group_by(ObsGroup, add = TRUE) %>%
  summarize(
    Other_observers = length(unique(Observer)) - 1,
    Who = paste(unique(setdiff(Observer, Observer[1])), collapse = " "),
    DateTime = DateTime[1],
    Observer = Observer[1]
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 13 x 6
#>    Animal ObsGroup Other_observers Who         DateTime            Observer
#>                                             
#>  1      1        1               0 ""          2016-08-01 12:04:07 Peter   
#>  2      1        2               0 ""          2016-08-01 12:06:58 Peter   
#>  3      1        3               1 Sophie      2016-08-01 13:12:12 Peter   
#>  4      1        4               0 ""          2016-08-01 17:13:16 Sophie  
#>  5      1        5               0 ""          2016-08-01 17:21:34 Sophie  
#>  6      1        6               0 ""          2016-08-01 17:23:42 Peter   
#>  7      1        7               0 ""          2016-08-01 17:29:28 Sophie  
#>  8      2        1               0 ""          2016-08-01 12:06:07 Sophie  
#>  9      2        2               0 ""          2016-08-01 14:04:07 Sophie  
#> 10      2        3               0 ""          2016-08-01 15:04:07 Peter   
#> 11      2        4               0 ""          2016-08-01 17:21:16 Sophie  
#> 12      2        5               2 Andreu Anna 2016-08-01 17:27:16 Peter   
#> 13      2        6               1 Anna        2016-08-01 17:28:17 Peter

^{Created on 2019-04-30 by the reprex package (v0.2.1)}

OLD SOLUTION:

Here is one solution using the excellent fuzzyjoin package. Essentially I'm joining the observations to itself as long as they are less than min_dist apart.

There are some tricky edge cases here that I didn't address. For example, if an observer records an observation for a single animal say every 30 s for 5 min, I believe they would all be filtered out as long as they are <1 min appart, except the first observation. This is probably not what you want, but I'm not sure right now how to address that.

library(tidyverse)
library(fuzzyjoin)
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")


min_diff = as.difftime(1, units = c("mins"))

df_wrangled <- df %>%
  as_tibble() %>%
  arrange(DateTime) %>%
  # Add a unique id for each observation
  mutate(id = 1:n()) %>%
  fuzzy_left_join(
    x = .,
    y = .,
    by = c("Animal", "DateTime"),
    match_fun = list(
      `==`,
      function(x, y) y - x < min_diff & y - x > 0
    )
  ) %>%
  # Remove observations that occured within `min_diff`
  filter(!(id.x %in% id.y)) %>%
  # Remove observations by same observer within `min_diff`
  filter(ifelse(is.na(Observer.y), TRUE, Observer.x != Observer.y)) %>%
  group_by(DateTime.x, Observer.x, Animal.x, id.x) %>%
  summarize(
    Other_observers = length(na.omit(Observer.y)),
    Who = paste(Observer.y, collapse = " ")
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 12 x 6
#>    DateTime.x          Observer.x Animal.x  id.x Other_observers Who       
#>                                             
#>  1 2016-08-01 12:04:07 Peter             1     1               0 NA        
#>  2 2016-08-01 12:06:07 Sophie            2     2               0 NA        
#>  3 2016-08-01 12:06:58 Peter             1     3               0 NA        
#>  4 2016-08-01 13:12:12 Peter             1     4               1 Sophie    
#>  5 2016-08-01 14:04:07 Sophie            2     6               0 NA        
#>  6 2016-08-01 15:04:07 Peter             2     7               0 NA        
#>  7 2016-08-01 17:13:16 Sophie            1     8               0 NA        
#>  8 2016-08-01 17:21:16 Sophie            2     9               0 NA        
#>  9 2016-08-01 17:21:34 Sophie            1    10               0 NA        
#> 10 2016-08-01 17:23:42 Peter             1    11               0 NA        
#> 11 2016-08-01 17:27:16 Peter             2    12               2 Andreu An…
#> 12 2016-08-01 17:29:28 Sophie            1    17               0 NA

^{Created on 2019-04-30 by the reprex package (v0.2.1)}

How to select specific rows of a DF depending on the time gap between rows and other two conditions related to other variables

Answers (1)

Related Questions