Reputation: 2101

Extract unique string elements from dataframe column in R

I have a dataframe containing two columns:

an id

string: which is a text string column, containing repeated text elements separated by the symbols /

library(tidyverse)

df_input <- data.frame(stringsAsFactors=FALSE,
                    id = c(123, 234, 345, 456),
                string = c("[\"aaa\"] / [\"aaa\"] / [\"aaa\"] / bbb / bbb / bbb",
                           "[\"hello hello\"] / [\"hello hello\"] / [\"hello hello\"] / [\"hello hello\"]",
                           "my name is tim / my name is tim / my name is tim", "[\"hello word\"]")
          )

Looking like:

       id                                                                string
     1 123                         ["aaa"] / ["aaa"] / ["aaa"] / bbb / bbb / bbb
     2 234 ["hello hello"] / ["hello hello"] / ["hello hello"] / ["hello hello"]
     3 345                      my name is tim / my name is tim / my name is Tim
     4 456                                                        ["hello word"]

The pattern I see is that each time there is a group of repeated elements, it is separated by the symbol /:

["aaa"] / ["aaa"] / ["aaa"] / bbb / bbb / bbb

Or:

my name is tim / my name is tim / my name is Tim

But also cases with a single element:

["hello word"]

I would like to have a dataframe like the following:

df_output <- data.frame(stringsAsFactors=FALSE,
                       id = c(123, 234, 345, 456),
                   string = c("[\"aaa\"] / bbb", "[\"hello hello\"]", "my name is tim",
                              "[\"hello word\"]")
             )

Where:

   id          string
1 123   ["aaa"] / bbb
2 234 ["hello hello"]
3 345  my name is tim
4 456  ["hello word"]

and I keep only the unique elements; if multiple elements are present, they are separated by /.

Any solution in dplyr?

Upvotes: 2

Answers (3)

akrun

Reputation: 887118

An option with cSplit and data.table

library(splitstackshape)
unique(cSplit(df_input, "string", sep= " / ", "long"))[,
       .(string = paste(string, collapse= " / ")),.(id)]
#   id          string
#1: 123   ["aaa"] / bbb
#2: 234 ["hello hello"]
#3: 345  my name is tim
#4: 456  ["hello word"]

Upvotes: 1

Martin Gal

Reputation: 16978

You could use dplyr and tidyr:

df_input %>%
  separate_rows(string, sep=" / ") %>%
  distinct() %>%
  group_by(id) %>%
  summarise(string = paste(string, collapse=" / "), .groups="drop") %>%
  as.data.frame()

returns

   id          string
1 123   ["aaa"] / bbb
2 234 ["hello hello"]
3 345  my name is tim
4 456  ["hello word"]

The group_by, summarise and as.data.frame part can be skipped with

aggregate(string ~ id, ., paste, collapse=" / ")

Upvotes: 2

RyanFrost

Reputation: 1428

Using stringr's str_split and purrr to map the unique finding and recombining:

library(stringr)
library(purrr)
library(dplyr)

df_input %>%
  mutate(string = string %>% 
           str_split(" / ") %>%
           map(unique) %>%
           map_chr(paste, collapse = " / "))
#>    id          string
#> 1 123   ["aaa"] / bbb
#> 2 234 ["hello hello"]
#> 3 345  my name is tim
#> 4 456  ["hello word"]

^{Created on 2020-07-02 by the reprex package (v0.3.0)}

Upvotes: 2

Extract unique string elements from dataframe column in R

Answers (3)

Related Questions