Felipe D.
Felipe D.

Reputation: 1271

Creating new column by splitting a `chr` column, finding unique values, sorting them, removing certain values, and combining them back into one string

I'm working in R, using tidyverse and dplyr functions to generate new columns, but I'm running into a wall when trying to find unique values within a string column. Here's a detailed description of the problem.

Setup

Suppose I have a tibble called df, with a chr column called col1 that contains strings. These strings are, in reality, a list of values that are separated by a comma (", "). Here's what df looks like:

library(tidyverse)
library(dplyr)

df = data.frame(id=c(1,2,3,4,5),
                col1=c("a, b, x, a","b, b","c, b, b, b", "b, x, b, c", "c")) %>%
  as_tibble()

print(df)

# # A tibble: 5 x 2    
#      id col1         
#   <dbl> <chr>        
# 1     1 a, b, x, a      
# 2     2 b, b         
# 3     3 c, b, b, b
# 4     4 b, x, b, c      
# 5     5 c            

The problem

I want to separate the values in col1 wherever we find ", ", remove any duplicate values, sort the unique values, remove the "x" values, then concatenate them back together into a string using ", " as a delimiter between the multiple unique items.

In more practical terms, I would like to create a column as seen below in col2:

# # A tibble: 5 x 3            
#      id col1          col2   
#   <dbl> <chr>         <chr>  
# 1     1 a, b, x, a    a, b   
# 2     2 b, b          b      
# 3     3 c, b, b, b    b, c   
# 4     4 b, x, b, c    b, c
# 5     5 c             c      

My attempt so far

If I just have a string variable, I know that I can do all of the processing in a couple of steps:

x = "b, x, b, c"
x_temp = unique(strsplit(x, ", ")[[1]])
x_simp = paste(sort(x_temp[x_temp != "x"]), collapse=", ")
print(x_simp)
# [1] "b, c"

However, I'm having a hard time translating this process back into the mutate function:

newdf = df %>% 
  mutate(col2 = paste(sort(unique(strsplit(col1, ", ")[[1]])[unique(strsplit(col1, ", ")[[1]]) != "x"]), collapse=", "))

# A tibble: 5 x 3
#    id col1              col2 
# <dbl> <chr>             <chr>
#   1   1 a, b, x, a      a, b 
#   2   2 b, b            a, b 
#   3   3 c, b, b, b      a, b 
#   4   4 b, x, b, c      a, b 
#   5   5 c               a, b 

Summary

How can I use tidyverse/dplyr functions to generate a new column that is the result of the following processing steps on one of a tibble/df's columns:

Upvotes: 1

Views: 65

Answers (2)

Felipe D.
Felipe D.

Reputation: 1271

I'm not sure how efficient this is, but I just figured out I can use the mapply function to apply a custom-built function on all rows of the input tibble as follows:

myfunc = function(in_str){
  temp = unique(strsplit(in_str, ", ")[[1]])
  simp = paste(sort(temp[temp != "x"]), collapse=", ")
  return(simp)
}

newdf2 = df %>% 
  mutate(col2 = mapply(myfunc, col1))

print(newdf2)
# # A tibble: 5 x 3
#      id    col1          col2 
#   <dbl>    <chr>         <chr>
# 1     1    a, b, x, a    a, b 
# 2     2    b, b          b    
# 3     3    c, b, b, b    b, c 
# 4     4    b, x, b, c    b, c 
# 5     5    c             c    

Upvotes: 1

akrun
akrun

Reputation: 886938

We may do this in tidyverse, by splitting with separate_rows, and then do a group by paste after removing the duplicates

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   mutate(col2 = col1) %>% 
   separate_rows(col2) %>%
   distinct(across(everything())) %>% 
   group_by(id, col1) %>% 
   summarise(col2 = str_c(sort(col2[col2 != "x"]), collapse = ", "),
       .groups = 'drop')

-output

# A tibble: 5 × 3
     id col1       col2 
  <dbl> <chr>      <chr>
1     1 a, b, x, a a, b 
2     2 b, b       b    
3     3 c, b, b, b b, c 
4     4 b, x, b, c b, c 
5     5 c          c    

Upvotes: 2

Related Questions