Reputation: 39

dplyr mutate with conditional values AND OR to create a group category

I am having a dataset that has a variable called individuals with many options and it comes like that. I have observations for a given Day on different individuals (Individual_ID)

The different options of individuals look like this: Individual_ID("Adele", "Fitz", "Abba").... these would belong to a group that is Group=A Individual_ID("Noir", "Rouge", "Bleue").... these would belong to a group called Group=B

In some instances, the individuals from different groups, can get mixed, so we have something like this Individual_ID("Adele", "Rouge", "Bleue")... so this would represent a mixed-group,

I would like to create a variable called GroupingID that can be either GroupA, GroupB, or MixedGroup For that I do not precise that all individuals of the group are present, but instead, that the representation of the individuals is neat or not neat with respect to their group.

In order to consider a mixed grouping, any combination involving at least two individuals from different groups is sufficient.

Could someone explain me how I could apply a condition AND/OR in mutate to create a variable Grouping?

Here how my data looks like

Date      IndividualsObserved    
1/1/2016   Abba,Adele
2/1/2016   Adele,Fitz
3/1/2016   Fitz,Rouge,Noir
4/1/2016   Fitz,Adele,Abba
5/1/2016   Rouge,Noir,Bleue
6/1/2016   Rouge,Abba,Fitz

(the different individuals appear separated by commas in each entry cell of the column IndividualsObserved)

So I would like to have a grouping category that is able to discern whether the grouping is neat (only one group identity, or whether the grouping is composed by a mixed of individuals from different groups). It would be something like this (GroupingID)

Date      IndividualsObserved   GroupingID
1/1/2016   Abba,Adele           GroupA
2/1/2016   Adele,Fitz           GroupA
3/1/2016   Fitz,Rouge,Noir      MixedGrouping
4/1/2016   Fitz,Adele,Abba      GroupA
5/1/2016   Rouge,Noir,Bleue     GroupB
6/1/2016   Rouge,Abba,Fitz      MixedGrouping
7/1/2016   Noir,Bleue,Abba      MixedGrouping

I tried this but did not work:

  mutate(GroupingID = case_when(IndividualsObserved %in% c("Adele","Abba", "Fitz") ~ "GroupA",
                                IndividualsObserved %in% c("Noir","Bleue", "Rouge") ~ "GroupB",
                                TRUE ~ ToCheck))

I would appreciate any insights you may have about how to approach this using the mutate option,

I tried using dplyr function mutate

Update:

Many thanks Mark, r2evans, and hello_friend for your helpful suggestions, Indeed, it works out in the different ways you propose!

Now that I have applied this to my extensive dataset, I realise I have a few challenging cases. Perhaps you have some ideas about how to:

-consider specific individuals as "ambiguous", meaning they do not belong to any group, so they cannot be considered group A or B as they are outsiders visiting the two. Could it be possible to assign these individuals a status that does not affect the MixedGroup? If they were there, but their presence or absence did not change the overall group composition, could they have a neutral status?

-create an additional column that says GroupDetails that could be GroupA or GroupB or GroupA+GroupB attending to the list provided with the individuals

-finally, because the list has some 30000 entries, would it be possible to request with an R function to obtain all the names of IndividualsObserved (the complete list is more extensive than the one I provided as an example)?

Thanks a lot

Upvotes: 1

Answers (4)

MIGUEL

Reputation: 39

Thanks all for sharing your insights,

I realised some individuals do not belong to any group as we define group membership when individuals are established and stop migrating in and out of groups.

I would like to know how to consider some individuals as "migratory or undecided" so they have a neutral status that does not affect the original binary classification of:

     i) Group A or Group B, and 
     ii) MixedGroup.

I complement the data example here (see dates example 7, 8, and 9 January 2016):

     {Date      IndividualsObserved    
     1/1/2016   Abba,Adele
     2/1/2016   Adele,Fitz
     3/1/2016   Fitz,Rouge,Noir
     4/1/2016   Fitz,Adele,Abba
     5/1/2016   Rouge,Noir,Bleue
     6/1/2016   Rouge,Abba,Fitz}
     7/1/2016   Rouge,Abba,Guacamole
     8/1/2016   Fitz,Rouge,Saphir
     9/1/2016   Abba,Adele,Dylan"

Where the group would be maintained as,

    {"A" = c("Adele","Fitz","Abba"),
    "B" = c("Rouge","Noir","Bleue"),
    "Neutral" = c("Guacamole","Saphir","Dylan")}

So, the presence of "Neutral" individuals does not affect the categorisation of the collective into GroupA, GroupB, or MixedGrouping. Accordingly, these examples should be attributed to the category as follows.

      {7/1/2016: Rouge,Abba,Guacamole 
      would be MixedGrouping (Rouge and Abba are from different groups; Guacamole is neutral)  
       8/1/2016: Fitz,Rouge,Saphir    
       would be MixedGrouping because Rouge and Fitz are from other groups; Saphir is         neutral)
       9/1/2016: Abba,Adele,Dylan
       would be GroupA (Abba and Adele are from the same group; Dylan is neutral)}

This expands into accounting for sex ratio presence in "neat/homogenous group compositions" such as HomogeneousGrouping (GroupA or GroupB) or MixedGrouping.

I have been trying to compute this in my data, which has more than 30000 entries, but I have not found a method yet. If we have a datafile with the sex information:

    {Individual    GroupingID   Sex
     Adele           GroupA       F
     Abba            GroupA       F
     Fitz            GroupA       F
     Rouge           GroupB       M
     Noir            GroupB       M
     Bleue           GroupB       F
     Saphir          Neutral      F
     Guacamole       Neutral      M
     Dylan           Neutral      M}

Which approach could help compute the sex ratio considering all individuals from any GroupingID (also neutral ones here) into new columns? The sex ratio would be a score calculated by dividing total females by total males. Having two columns would be great as my ultimate interest is to compare the sex ratios and the grouping style (HomogeneousGrouping VS MixedGrouping).

        {1st column: HomogeneousGroupingSexRatio (only GroupingID: A or B) 
        2nd column: MixedGroupingSexRatio (more than 1 GroupingID: A+B)}

Thanks a lot for sharing your thoughts!

Upvotes: 0

hello_friend

Reputation: 5798

Base R Solution:

# Resolve the values to classify into distinct groups;
# map_from => character vector
map_from <- c("Adele", "Fitz", "Abba", "Rouge", "Noir", "Bleue")

# Resolve the groups for each value specified above: 
# map_to => character vector
map_to <- c("A", "A", "A", "B", "B", "B")

# Resolve the values to map: 
# value_map => named character vector
value_map <- setNames(map_to, map_from)

# Resolve the group: GroupingID => character vector
df$GroupingID <- vapply(
  # For each value in the IndividualsObserved vector: 
  df$IndividualsObserved,
  function(x){
    # For each element in the list: 
    ir <- lapply(
      # Split the string into a list: 
      strsplit(x, ","),
      function(y){
        # Dictionary replace the values: 
        # character vector => env
        value_map[y]
      }
    )
    # Unlist the list into a vector: 
    # unlisted_ir => character vector: 
    unlisted_ir <- unlist(ir)
    # Resolve the number of unique values: 
    # n_unique => integer scalar
    n_unique <- length(unique(unlisted_ir))
    # If there is a single group: 
    if(n_unique == 1){
      # use the first value: character vector => env 
      unlisted_ir[1]
    }else{
      # use the default value: character vector => env       
      "Mixed Group"
    }
  },
  # Explicitly define a character vector of length one 
  # is returned: 
  character(1),
  # Ensure the names of the character vector aren't used:
  USE.NAMES = FALSE
)

Input Data:

# Resolve the input data.frame: 
# df => data.frame
df <- read.table(
  text = "Date      IndividualsObserved
    1/1/2016   Abba,Adele
    2/1/2016   Adele,Fitz
    3/1/2016   Fitz,Rouge,Noir
    4/1/2016   Fitz,Adele,Abba
    5/1/2016   Rouge,Noir,Bleue
    6/1/2016   Rouge,Abba,Fitz
    7/1/2016   Noir,Bleue,Abba", 
  header = TRUE
)

Upvotes: 1

r2evans

Reputation: 160687

Similar to Mark's answer, but after creating a list-column, we can look for all(.. %in% ..) membership to define the groups.

quux %>%
  mutate(IndividualsObserved = strsplit(IndividualsObserved, ",")) %>%
  rowwise() %>%
  mutate(
    GroupingID = case_when(
      all(IndividualsObserved %in% c("Adele","Abba", "Fitz")) ~ "GroupA", 
      all(IndividualsObserved %in% c("Noir","Bleue", "Rouge")) ~ "GroupB", 
      TRUE ~ "MixedGroup")
  ) %>%
  ungroup()
# # A tibble: 6 × 3
#   Date     IndividualsObserved GroupingID
#   <chr>    <list>              <chr>     
# 1 1/1/2016 <chr [2]>           GroupA    
# 2 2/1/2016 <chr [2]>           GroupA    
# 3 3/1/2016 <chr [3]>           MixedGroup
# 4 4/1/2016 <chr [3]>           GroupA    
# 5 5/1/2016 <chr [3]>           GroupB    
# 6 6/1/2016 <chr [3]>           MixedGroup

I'm generally not a fan of doing things rowwise(), but it works well-enough here and is unlikely to be a performance problem unless your real data is fairly large.

Data

quux <- structure(list(Date = c("1/1/2016", "2/1/2016", "3/1/2016", "4/1/2016", "5/1/2016", "6/1/2016"), IndividualsObserved = c("Abba,Adele", "Adele,Fitz", "Fitz,Rouge,Noir", "Fitz,Adele,Abba", "Rouge,Noir,Bleue", "Rouge,Abba,Fitz")), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 1

Mark

Reputation: 12558

Steps:

Create a named list for the groups
Split each Individuals row into a list, giving us a list column
For each row in the new list column, check if any of the names are in groups A and B. If both, then mixed, A then A, B then B, neither then None.

library(tidyverse)

groups <- list("A" = c("Adele", "Fitz", "Abba"),
               "B" = c("Rouge", "Noir", "Bleue"))

df |>
  mutate(IndividualsObserved = str_split(IndividualsObserved, ","),
         Group = map_chr(IndividualsObserved, \(x) {
            a <- any(x %in% groups$A)
            b <- any(x %in% groups$B)
            case_when(a & b ~ "MixedGrouping",
                      a ~ "GroupA",
                      b ~ "GroupB",
                      TRUE ~ "None")}))

Output:

      Date IndividualsObserved         Group
1 1/1/2016         Abba, Adele        GroupA
2 2/1/2016         Adele, Fitz        GroupA
3 3/1/2016   Fitz, Rouge, Noir MixedGrouping
4 4/1/2016   Fitz, Adele, Abba        GroupA
5 5/1/2016  Rouge, Noir, Bleue        GroupB
6 6/1/2016   Rouge, Abba, Fitz MixedGrouping
7 7/1/2016   Noir, Bleue, Abba MixedGrouping

There's many other ways you could do this, e.g. making a dataframe of groups with their corresponding individuals, separating each individual in df into it's own row, and doing a join, to give but one way, but this is the most straightforward in my opinion.

Data:

df <- read.table(text= 
"Date      IndividualsObserved
1/1/2016   Abba,Adele
2/1/2016   Adele,Fitz
3/1/2016   Fitz,Rouge,Noir
4/1/2016   Fitz,Adele,Abba
5/1/2016   Rouge,Noir,Bleue
6/1/2016   Rouge,Abba,Fitz
7/1/2016   Noir,Bleue,Abba", header = T)

Upvotes: 4

dplyr mutate with conditional values AND OR to create a group category

Update:

Answers (4)

Related Questions