Reputation: 1300
I want to identify unique sets of row values in a column based on row values in another column to ultimately create a new column in the dataframe. The following picture illustrates my problem and the expected result (i.e., the expected_outcome
column).
For example:
The first 3 rows have values T1
in the column trial
, and values D1, D2, D3
in the column group
.
The next 3 rows have values T3
in the column trial
, and values D3, D2, D1
in the column group
.
Because the set D1, D2, D3
has the same contain as D3, D2, D1
, I want all the 6 rows to have the same value in column expected_outcome
.
My data is way more complex than that. I may have to make this grouping over more than 2 columns. So, I prefer a generic solution to this problem. Below is the data in the picture.
test_data <- data.frame(
trial = c("T1", "T1", "T1", "T3", "T3", "T3", "T5", "T5", "T6", "T6", "T6"),
group = c("D1", "D2", "D3", "D3", "D2", "D1", "D1", "D3", "D1", "D3", "D2")
)
Upvotes: 3
Views: 171
Reputation: 16856
You could do something like this via tidyverse
.
library(tidyverse)
test_data %>%
group_by(trial) %>%
summarize(type = paste(sort(unique(group)), collapse = ", "), group) %>%
group_by(type) %>%
mutate(expected_outcome = cur_group_id()) %>%
ungroup() %>%
dplyr::select(-"type")
Output
# A tibble: 11 × 3
trial group expected_outcome
<chr> <chr> <int>
1 T1 D1 1
2 T1 D2 1
3 T1 D3 1
4 T3 D3 1
5 T3 D2 1
6 T3 D1 1
7 T5 D1 2
8 T5 D3 2
9 T6 D1 1
10 T6 D3 1
11 T6 D2 1
Data
test_data <-
structure(list(
trial = c("T1", "T1", "T1", "T3", "T3", "T3",
"T5", "T5", "T6", "T6", "T6"),
group = c("D1", "D2", "D3", "D3",
"D2", "D1", "D1", "D3", "D1", "D3", "D2")
),
class = "data.frame",
row.names = c(NA,-11L))
Upvotes: 2
Reputation: 72899
Note: This answer is based on the original problem before OP was edited, where groups had to be identified by themselves without an erroneous 'trial'
column.
factor
ize group column and, using ave
calculate cumsum
of duplicated
s in each group. Then, where even numbers appear, cumsum
the differences that equal 1
.
(test_data <- within(test_data, {
group_int <- as.integer(as.factor(group))
group_2 <- ave(group_int, group_int, FUN=\(x)
cumsum(duplicated(x)))
expected_outcome <- cumsum(c(1, diff(group_2 %% 2 == 0)) == 1)
rm(group_int, group_2)
}))
# trial group expected_outcome
# 1 T1 D1 1
# 2 T1 D2 1
# 3 T1 D3 1
# 4 T3 D3 1
# 5 T3 D2 1
# 6 T3 D1 1
# 7 T5 D1 2
# 8 T5 D3 2
# 9 T6 D1 2
# 10 T6 D3 2
# 11 T6 D2 3
Note: R >= 4.1
Data:
test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3",
"T5", "T5", "T6", "T6", "T6"), group = c("D1", "D2", "D3", "D3",
"D2", "D1", "D1", "D3", "D1", "D3", "D2"), expected_outcome = c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L)), row.names = c(NA, -11L
), class = "data.frame")
Upvotes: 2
Reputation: 78927
I think the last trial
in your column should be T7
?
library(dplyr)
test_data %>%
arrange(across(everything())) %>%
group_by(trial) %>%
mutate(expected_outcome = toString(group)) %>%
group_by(expected_outcome) %>%
mutate(expected_outcome = cur_group_id())
trial group expected_outcome
<chr> <chr> <int>
1 T1 D1 1
2 T1 D2 1
3 T1 D3 1
4 T3 D1 1
5 T3 D2 1
6 T3 D3 1
7 T5 D1 2
8 T5 D3 2
9 T6 D1 2
10 T6 D3 2
11 T7 D2 3
data:
test_data <- structure(list(trial = c("T1", "T1", "T1", "T3", "T3", "T3",
"T5", "T5", "T6", "T6", "T7"), group = c("D1", "D2", "D3", "D3",
"D2", "D1", "D1", "D3", "D1", "D3", "D2")), class = "data.frame", row.names = c(NA,
-11L))
Upvotes: 3