Reputation: 297
I would like to merge rows in a dataframe if they have at least one word in common and have the same value for 'code'. The column to be searched for matching words is "name". Here's an example dataset:
df <- data.frame(
id = 1:8,
name = c("tiger ltd", "tiger cpy", "tiger", "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c(rep("4564AB", 3), rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The approach that I envision would look something like this:
The final dataset would look like this:
final_df <- data.frame(
id = c("1|2|3", 4:8),
name = c(paste(c("tiger ltd", "tiger cpy", "tiger"), collapse = "|"), "rhino", "hippo", "elephant", "elephant bros", "last comp"),
code = c("4564AB", rep("7845BC", 2), "6144DE", "7845KI", "7845EG")
)
The first three rows have the common word 'tiger' and the same code. Therefore they are merged into a single row with the different values separated by "|". The other rows are not merged because they either do not have a word in common or do not have the same code.
Upvotes: 1
Views: 393
Reputation: 887951
We could have a condition with if/else
after grouping. Extract the words from the 'name' column and check for any intersect
ing elements, create a flag
where the length
of intersecting elements are greater than 0 and the group size (n()
) is greater than 1 and use this to paste/str_c
elements of the other columns
library(dplyr)
library(stringr)
library(purrr)
library(magrittr)
df %>%
group_by(code = factor(code, levels = unique(code))) %>%
mutate(flag = n() > 1 &
(str_extract_all(name, "\\w+") %>%
reduce(intersect) %>%
length %>%
is_greater_than(0))) %>%
summarise(across(-flag, ~ if(any(flag))
str_c(.x, collapse = "|") else as.character(.x)), .groups = 'drop') %>%
select(names(df))
-output
# A tibble: 6 × 3
id name code
<chr> <chr> <fct>
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
-OP's expected
> final_df
id name code
1 1|2|3 tiger ltd|tiger cpy|tiger 4564AB
2 4 rhino 7845BC
3 5 hippo 7845BC
4 6 elephant 6144DE
5 7 elephant bros 7845KI
6 8 last comp 7845EG
Upvotes: 1
Reputation: 24867
You can use this helper function f()
, and apply it to each group:
f <- function(d) {
if(length(Reduce(intersect,strsplit(d[["name"]]," ")))>0) {
d = lapply(d,paste0,collapse="|")
}
return(d)
}
library(data.table)
setDT(df)[,id:=as.character(id)][, f(.SD),code]
Output:
code id name
<char> <char> <char>
1: 4564AB 1|2|3 tiger ltd|tiger cpy|tiger
2: 7845BC 4 rhino
3: 7845BC 5 hippo
4: 6144DE 6 elephant
5: 7845KI 7 elephant bros
6: 7845EG 8 last comp
Upvotes: 1