Reputation: 2318
I have a dataframe that I want to group based on the values in a column.
The trick is though there are some rows that need be merged if the values have already appeared in a column that I am grouping by.
For example:
df <- data.frame(col1 = c("R1", "R2", "R2", "R2", "R2", "R4", "R5", "R5", "R5"),
col2 = c("R10", "R4", "R5", "R6", "R7", "R5", "R6", "R7", "R9"), stringsAsFactors = FALSE)
df2 <- aggregate(col2 ~ col1, df, FUN = function(x) paste(unique(x), collapse = ", "))
> df
col1 col2
1 R1 R10
2 R2 R4
3 R2 R5
4 R2 R6
5 R2 R7
6 R4 R5
7 R5 R6
8 R5 R7
9 R5 R9
> df2
col1 col2
1 R1 R10
2 R2 R4, R5, R6, R7
3 R4 R5
4 R5 R6, R7, R9
R10 will be in group R1 (row 1)
R4, R5, R6 and R7 will be in group R2 (rows 2 to 5) R5 will be in group R4 (row 6)
R6, R7 and R9 will be in group R5 (rows: 7 to 9)
But R4 and R5 are already in R2 so this will stay in R2. For R9 which is originally assigned to R5, needs to be grouped in to R2.
So the desired outcome will be:
> df3
col1 col2
1 R1 R10
2 R2 R4, R5, R6, R7, R9
OR preferably:
1 col1 col2
2 R1 R10
3 R2 R4
4 R2 R5
5 R2 R6
6 R2 R7
7 R2 R9
Upvotes: 2
Views: 769
Reputation: 887891
An option would be to replace
the values based on the intersect
ing elements and then do the aggregate
i1 <- df$col1 %in% df$col2
df$col1[i1] <- df$col1[match(df$col1[inds], df$col2)]
aggregate(col2 ~ col1, unique(df), FUN = toString)
# col1 col2
#1 R1 R10
#2 R2 R4, R5, R6, R7, R9
Or with tidyverse
library(dplyr)
library(stringr)
df %>%
group_by(col1 = case_when(col1 %in% intersect(col1, col2) ~ "R2",
TRUE ~ col1)) %>%
distinct %>%
summarise(col2 = toString(col2))
# A tibble: 2 x 2
# col1 col2
# <chr> <chr>
#1 R1 R10
#2 R2 R4, R5, R6, R7, R9
Upvotes: 3
Reputation: 389275
One option with base R can be
inds <- df$col1 %in% df$col2
df$col1[inds] <- df$col1[match(df$col1[inds], df$col2)]
and then we can take only unique
values of dataframe
unique(df)
# col1 col2
#1 R1 R10
#2 R2 R4
#3 R2 R5
#4 R2 R6
#5 R2 R7
#9 R2 R9
Or if you want comma-separated string
aggregate(col2 ~ col1, unique(df), toString)
# col1 col2
#1 R1 R10
#2 R2 R4, R5, R6, R7, R9
Upvotes: 1