Liman
Liman

Reputation: 1300

Create column based on distinctive sets of multiple row values

I want to identify unique sets of row values in a column based on row values in another column to ultimately create a new column in the dataframe. The following picture illustrates my problem and the expected result (i.e., the expected_outcome column).

For example:

Because the set D1, D2, D3 has the same contain as D3, D2, D1, I want all the 6 rows to have the same value in column expected_outcome.

My data is way more complex than that. I may have to make this grouping over more than 2 columns. So, I prefer a generic solution to this problem. Below is the data in the picture.

test_data <- data.frame(
  trial            = c("T1", "T1", "T1", "T3", "T3", "T3", "T5", "T5", "T6", "T6", "T6"),
  group            = c("D1", "D2", "D3", "D3", "D2", "D1", "D1", "D3", "D1", "D3", "D2")
)

Data and expected result

Upvotes: 3

Views: 171

Answers (3)

AndrewGB
AndrewGB

Reputation: 16856

You could do something like this via tidyverse.

library(tidyverse)

test_data %>%
  group_by(trial) %>%
  summarize(type = paste(sort(unique(group)), collapse = ", "), group) %>%
  group_by(type) %>%
  mutate(expected_outcome = cur_group_id()) %>%
  ungroup() %>%
  dplyr::select(-"type")

Output

# A tibble: 11 × 3
   trial group expected_outcome
   <chr> <chr>            <int>
 1 T1    D1                   1
 2 T1    D2                   1
 3 T1    D3                   1
 4 T3    D3                   1
 5 T3    D2                   1
 6 T3    D1                   1
 7 T5    D1                   2
 8 T5    D3                   2
 9 T6    D1                   1
10 T6    D3                   1
11 T6    D2                   1

Data

test_data <-
  structure(list(
    trial = c("T1", "T1", "T1", "T3", "T3", "T3",
              "T5", "T5", "T6", "T6", "T6"),
    group = c("D1", "D2", "D3", "D3",
              "D2", "D1", "D1", "D3", "D1", "D3", "D2")
  ),
  class = "data.frame",
  row.names = c(NA,-11L))

Upvotes: 2

jay.sf
jay.sf

Reputation: 72899

Note: This answer is based on the original problem before OP was edited, where groups had to be identified by themselves without an erroneous 'trial' column.

factorize group column and, using ave calculate cumsum of duplicateds in each group. Then, where even numbers appear, cumsum the differences that equal 1.

(test_data <- within(test_data, { 
                    group_int <- as.integer(as.factor(group))
                    group_2 <- ave(group_int, group_int, FUN=\(x) 
                                   cumsum(duplicated(x)))
                    expected_outcome <- cumsum(c(1, diff(group_2 %% 2 == 0)) == 1)
                    rm(group_int, group_2)
}))
#    trial group expected_outcome
# 1     T1    D1                1
# 2     T1    D2                1
# 3     T1    D3                1
# 4     T3    D3                1
# 5     T3    D2                1
# 6     T3    D1                1
# 7     T5    D1                2
# 8     T5    D3                2
# 9     T6    D1                2
# 10    T6    D3                2
# 11    T6    D2                3

Note: R >= 4.1


Data:

test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3", 
"T5", "T5", "T6", "T6", "T6"), group = c("D1", "D2", "D3", "D3", 
"D2", "D1", "D1", "D3", "D1", "D3", "D2"), expected_outcome = c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L)), row.names = c(NA, -11L
), class = "data.frame")

Upvotes: 2

TarJae
TarJae

Reputation: 78927

I think the last trial in your column should be T7?

library(dplyr)

test_data %>% 
  arrange(across(everything())) %>% 
  group_by(trial) %>% 
  mutate(expected_outcome = toString(group)) %>%
  group_by(expected_outcome) %>% 
  mutate(expected_outcome = cur_group_id())
    trial group expected_outcome
   <chr> <chr>            <int>
 1 T1    D1                   1
 2 T1    D2                   1
 3 T1    D3                   1
 4 T3    D1                   1
 5 T3    D2                   1
 6 T3    D3                   1
 7 T5    D1                   2
 8 T5    D3                   2
 9 T6    D1                   2
10 T6    D3                   2
11 T7    D2                   3

data:

test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3", 
"T5", "T5", "T6", "T6", "T7"), group = c("D1", "D2", "D3", "D3", 
"D2", "D1", "D1", "D3", "D1", "D3", "D2")), class = "data.frame", row.names = c(NA, 
-11L))

Upvotes: 3

Related Questions