Reputation: 39
I am working with a large dataset (>15000 rows) that consist on identification of individual animals observed within 30 minute blocks of time in three different groups (GroupA, GroupB, GroupC).
Each row of observation has the Date (DD/MM/YYYY) and TimeStart (in blocks of 30 minutes> starting at 6am, 6:30am, 7am etc.) and the IndividualsObserved (in GroupA: IndA1, IndA2, IndA3, IndA4... IndA12. In GroupB: same, from IndB1 to IndB12, and in GroupC the same, from IndC1 to IndC12). Not all individuals have to be observed in any one observation.
So the data looks like this:
Date Group TimeStart IndividualsObserved
01/01/2018 GroupA 6:00 IndA1,IndA2,IndA3,IndA8
01/01/2018 GroupB;C 6:00 IndB1,IndB4,IndB7,IndC2,IndC5,IndC10
01/01/2018 GroupB 6:00 IndB2,IndB3,IndB5,IndB6,IndB8,IndB9
I am trying to find a systematic way in which some individuals of a given group may be registered in two separated observations at the same Date and TimeStart because sometimes groups can mix (i.e. when GroupA is with GroupB= GroupA;B). If the groups mix, we could have GroupB mixed with GroupC, but some individuals of GroupB may not be mixed (with other groups).
I thought, that maybe it'd be possible to use dplyr code to ask cases when the total list of individuals for any of the three groups group GroupA, GroupB, and GroupC appears in different rows, as would be the case for the second and third row I provide for the case of GroupB.
In that example, the individuals of GroupB (not all, but most) appear in two rows (but it could be more than 2) having same Date and same TimeStart containing individuals of a given group. Would anyone know how to create a column (MatchedSimultaneousEntries) as I suggest in example below? Basically this column would identify as Yes when the individuals of a group may appear in two separate rows that share Date and TimeStart, and No when that was not the case.
Following that:
If GroupA individuals are: IndA1 to IndA12, If GroupB individuals are: IndB1 to IndB12, and if GroupC individuals are: IndC1 to IndC12,
Then:
Date Group TimeStart IndividualsObserved MatchedSimultaneousEntries
01/01/2018 GroupA 6:00 IndA1,IndA2,IndA3,IndA8 No
01/01/2018 GroupB;C 6:00 IndB1,IndB4,IndB7,IndB10, IndC2,IndC5,IndC10 Yes
01/01/2018 GroupB 6:00 IndB2,IndB3,IndB5,IndB6,IndB8,IndB9 Yes
The Yes in the new column would be assigned becasue the individuals of GroupB are present in two rows that have same Date and TimeStart.
The names I provide are actual names like Fitz, Armstrong or Bowie, I just provided this for simplicity. I would need to use the actual names in a list to identify whether some of the names of a group are in a single row or in more than one row that share Date and TimeStart.
Thanks for your help!
I tried case_when and mutate, but I did not manage
Upvotes: 0
Views: 97
Reputation: 160687
Here's a start of some code that pivots/unnests, compares, and then re-compresses. It doesn't find "yes" in the third row, not sure where the logic is missing.
Mostly posting now as a way to start/refine the logic.
library(dplyr)
quux %>%
mutate(rn = row_number(), across(c(Group, IndividualsObserved), ~ strsplit(.x, "[;,]"))) %>%
unnest(IndividualsObserved) %>%
unnest(Group) %>%
mutate(Group = sub("^Group", "", Group)) %>%
mutate(
MatchedSimultaneousEntries = n_distinct(Group) > 1,
.by = c(Date, TimeStart, IndividualsObserved)
) %>%
summarize(
across(c(Group, IndividualsObserved), ~ paste(unique(.x), collapse = ";")),
MatchedSimultaneousEntries = any(MatchedSimultaneousEntries),
.by = c(rn, Date, TimeStart)
)
# # A tibble: 3 × 6
# rn Date TimeStart Group IndividualsObserved MatchedSimultaneousEntries
# <int> <chr> <chr> <chr> <chr> <lgl>
# 1 1 01/01/2018 6:00 A IndA1;IndA2;IndA3;IndA8 FALSE
# 2 2 01/01/2018 6:00 B;C IndB1;IndB4;IndB7;IndC2;IndC5;IndC10 TRUE
# 3 3 01/01/2018 6:00 B IndB2;IndB3;IndB5;IndB6;IndB8;IndB9 FALSE
Notes:
rn
so that the double-unnest
would be perfectly reversible. Without it, we would rely on properties of the data that do not currently exist easily."Group"
string literal from the same-named column, it enabled grouping by the single letterGroup
and IndividualsObserved
using ;
instead of ;
in one and ,
in the other ... no issue with doing the different kindssummarize
), and/or as list-columns.Data
quux <- structure(list(Date = c("01/01/2018", "01/01/2018", "01/01/2018"), Group = c("GroupA", "GroupB;C", "GroupB"), TimeStart = c("6:00", "6:00", "6:00"), IndividualsObserved = c("IndA1,IndA2,IndA3,IndA8", "IndB1,IndB4,IndB7,IndC2,IndC5,IndC10", "IndB2,IndB3,IndB5,IndB6,IndB8,IndB9")), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1