dplyr how to identify two columns having same value (date and hour)

Question

I am working with a large dataset (>15000 rows) that consist on identification of individual animals observed within 30 minute blocks of time in three different groups (GroupA, GroupB, GroupC).

Each row of observation has the Date (DD/MM/YYYY) and TimeStart (in blocks of 30 minutes> starting at 6am, 6:30am, 7am etc.) and the IndividualsObserved (in GroupA: IndA1, IndA2, IndA3, IndA4... IndA12. In GroupB: same, from IndB1 to IndB12, and in GroupC the same, from IndC1 to IndC12). Not all individuals have to be observed in any one observation.

So the data looks like this:

Date            Group        TimeStart    IndividualsObserved
01/01/2018      GroupA         6:00       IndA1,IndA2,IndA3,IndA8
01/01/2018      GroupB;C       6:00       IndB1,IndB4,IndB7,IndC2,IndC5,IndC10
01/01/2018      GroupB         6:00       IndB2,IndB3,IndB5,IndB6,IndB8,IndB9

I am trying to find a systematic way in which some individuals of a given group may be registered in two separated observations at the same Date and TimeStart because sometimes groups can mix (i.e. when GroupA is with GroupB= GroupA;B). If the groups mix, we could have GroupB mixed with GroupC, but some individuals of GroupB may not be mixed (with other groups).

I thought, that maybe it'd be possible to use dplyr code to ask cases when the total list of individuals for any of the three groups group GroupA, GroupB, and GroupC appears in different rows, as would be the case for the second and third row I provide for the case of GroupB.

In that example, the individuals of GroupB (not all, but most) appear in two rows (but it could be more than 2) having same Date and same TimeStart containing individuals of a given group. Would anyone know how to create a column (MatchedSimultaneousEntries) as I suggest in example below? Basically this column would identify as Yes when the individuals of a group may appear in two separate rows that share Date and TimeStart, and No when that was not the case.

Following that:

If GroupA individuals are: IndA1 to IndA12, If GroupB individuals are: IndB1 to IndB12, and if GroupC individuals are: IndC1 to IndC12,

Then:

Date         Group      TimeStart  IndividualsObserved                          MatchedSimultaneousEntries                                        
01/01/2018   GroupA     6:00       IndA1,IndA2,IndA3,IndA8                         No 
01/01/2018   GroupB;C   6:00       IndB1,IndB4,IndB7,IndB10, IndC2,IndC5,IndC10    Yes
01/01/2018   GroupB     6:00       IndB2,IndB3,IndB5,IndB6,IndB8,IndB9             Yes

The Yes in the new column would be assigned becasue the individuals of GroupB are present in two rows that have same Date and TimeStart.

The names I provide are actual names like Fitz, Armstrong or Bowie, I just provided this for simplicity. I would need to use the actual names in a list to identify whether some of the names of a group are in a single row or in more than one row that share Date and TimeStart.

Thanks for your help!

I tried case_when and mutate, but I did not manage

r2evans · Accepted Answer

Here's a start of some code that pivots/unnests, compares, and then re-compresses. It doesn't find "yes" in the third row, not sure where the logic is missing.

Mostly posting now as a way to start/refine the logic.

library(dplyr)
quux %>%
  mutate(rn = row_number(), across(c(Group, IndividualsObserved), ~ strsplit(.x, "[;,]"))) %>%
  unnest(IndividualsObserved) %>%
  unnest(Group) %>%
  mutate(Group = sub("^Group", "", Group)) %>%
  mutate(
    MatchedSimultaneousEntries = n_distinct(Group) > 1,
    .by = c(Date, TimeStart, IndividualsObserved)
  ) %>%
  summarize(
    across(c(Group, IndividualsObserved), ~ paste(unique(.x), collapse = ";")),
    MatchedSimultaneousEntries = any(MatchedSimultaneousEntries),
    .by = c(rn, Date, TimeStart)
  )
# # A tibble: 3 × 6
#      rn Date       TimeStart Group IndividualsObserved                  MatchedSimultaneousEntries
#                                                                     
# 1     1 01/01/2018 6:00      A     IndA1;IndA2;IndA3;IndA8              FALSE                     
# 2     2 01/01/2018 6:00      B;C   IndB1;IndB4;IndB7;IndC2;IndC5;IndC10 TRUE                      
# 3     3 01/01/2018 6:00      B     IndB2;IndB3;IndB5;IndB6;IndB8;IndB9  FALSE

Notes:

I added (and, for demonstration purposes, did not remove) rn so that the double-unnest would be perfectly reversible. Without it, we would rely on properties of the data that do not currently exist easily.
I chose to remove the leading "Group" string literal from the same-named column, it enabled grouping by the single letter
I chose to re-collapse both Group and IndividualsObserved using ; instead of ; in one and , in the other ... no issue with doing the different kinds
I re-collapsed them, but frankly depending on OP's intent, it might be much simpler to keep it in the longer format (pre-summarize), and/or as list-columns.

Data

quux <- structure(list(Date = c("01/01/2018", "01/01/2018", "01/01/2018"), Group = c("GroupA", "GroupB;C", "GroupB"), TimeStart = c("6:00", "6:00", "6:00"), IndividualsObserved = c("IndA1,IndA2,IndA3,IndA8", "IndB1,IndB4,IndB7,IndC2,IndC5,IndC10", "IndB2,IndB3,IndB5,IndB6,IndB8,IndB9")), class = "data.frame", row.names = c(NA, -3L))

dplyr how to identify two columns having same value (date and hour)

Answers (1)

Related Questions