Reputation: 2101
I have a dataframe containing two columns:
an id
string
: which is a text string column, containing repeated text elements separated by the symbols /
library(tidyverse)
df_input <- data.frame(stringsAsFactors=FALSE,
id = c(123, 234, 345, 456),
string = c("[\"aaa\"] / [\"aaa\"] / [\"aaa\"] / bbb / bbb / bbb",
"[\"hello hello\"] / [\"hello hello\"] / [\"hello hello\"] / [\"hello hello\"]",
"my name is tim / my name is tim / my name is tim", "[\"hello word\"]")
)
Looking like:
id string
1 123 ["aaa"] / ["aaa"] / ["aaa"] / bbb / bbb / bbb
2 234 ["hello hello"] / ["hello hello"] / ["hello hello"] / ["hello hello"]
3 345 my name is tim / my name is tim / my name is Tim
4 456 ["hello word"]
The pattern I see is that each time there is a group of repeated elements, it is separated by the symbol /
:
["aaa"] / ["aaa"] / ["aaa"] / bbb / bbb / bbb
Or:
my name is tim / my name is tim / my name is Tim
But also cases with a single element:
["hello word"]
I would like to have a dataframe like the following:
df_output <- data.frame(stringsAsFactors=FALSE,
id = c(123, 234, 345, 456),
string = c("[\"aaa\"] / bbb", "[\"hello hello\"]", "my name is tim",
"[\"hello word\"]")
)
Where:
id string
1 123 ["aaa"] / bbb
2 234 ["hello hello"]
3 345 my name is tim
4 456 ["hello word"]
and I keep only the unique elements; if multiple elements are present, they are separated by /
.
Any solution in dplyr
?
Upvotes: 2
Views: 847
Reputation: 887118
An option with cSplit
and data.table
library(splitstackshape)
unique(cSplit(df_input, "string", sep= " / ", "long"))[,
.(string = paste(string, collapse= " / ")),.(id)]
# id string
#1: 123 ["aaa"] / bbb
#2: 234 ["hello hello"]
#3: 345 my name is tim
#4: 456 ["hello word"]
Upvotes: 1
Reputation: 16978
You could use dplyr
and tidyr
:
df_input %>%
separate_rows(string, sep=" / ") %>%
distinct() %>%
group_by(id) %>%
summarise(string = paste(string, collapse=" / "), .groups="drop") %>%
as.data.frame()
returns
id string
1 123 ["aaa"] / bbb
2 234 ["hello hello"]
3 345 my name is tim
4 456 ["hello word"]
The group_by
, summarise
and as.data.frame
part can be skipped with
aggregate(string ~ id, ., paste, collapse=" / ")
Upvotes: 2
Reputation: 1428
Using stringr's str_split
and purrr to map
the unique finding and recombining:
library(stringr)
library(purrr)
library(dplyr)
df_input %>%
mutate(string = string %>%
str_split(" / ") %>%
map(unique) %>%
map_chr(paste, collapse = " / "))
#> id string
#> 1 123 ["aaa"] / bbb
#> 2 234 ["hello hello"]
#> 3 345 my name is tim
#> 4 456 ["hello word"]
Created on 2020-07-02 by the reprex package (v0.3.0)
Upvotes: 2