Reputation: 1271
I'm working in R, using tidyverse
and dplyr
functions to generate new columns, but I'm running into a wall when trying to find unique values within a string column. Here's a detailed description of the problem.
Suppose I have a tibble called df
, with a chr
column called col1
that contains strings. These strings are, in reality, a list of values that are separated by a comma (", "
). Here's what df
looks like:
library(tidyverse)
library(dplyr)
df = data.frame(id=c(1,2,3,4,5),
col1=c("a, b, x, a","b, b","c, b, b, b", "b, x, b, c", "c")) %>%
as_tibble()
print(df)
# # A tibble: 5 x 2
# id col1
# <dbl> <chr>
# 1 1 a, b, x, a
# 2 2 b, b
# 3 3 c, b, b, b
# 4 4 b, x, b, c
# 5 5 c
I want to separate the values in col1
wherever we find ", "
, remove any duplicate values, sort the unique values, remove the "x"
values, then concatenate them back together into a string using ", "
as a delimiter between the multiple unique items.
In more practical terms, I would like to create a column as seen below in col2
:
# # A tibble: 5 x 3
# id col1 col2
# <dbl> <chr> <chr>
# 1 1 a, b, x, a a, b
# 2 2 b, b b
# 3 3 c, b, b, b b, c
# 4 4 b, x, b, c b, c
# 5 5 c c
If I just have a string variable, I know that I can do all of the processing in a couple of steps:
x = "b, x, b, c"
x_temp = unique(strsplit(x, ", ")[[1]])
x_simp = paste(sort(x_temp[x_temp != "x"]), collapse=", ")
print(x_simp)
# [1] "b, c"
However, I'm having a hard time translating this process back into the mutate
function:
newdf = df %>%
mutate(col2 = paste(sort(unique(strsplit(col1, ", ")[[1]])[unique(strsplit(col1, ", ")[[1]]) != "x"]), collapse=", "))
# A tibble: 5 x 3
# id col1 col2
# <dbl> <chr> <chr>
# 1 1 a, b, x, a a, b
# 2 2 b, b a, b
# 3 3 c, b, b, b a, b
# 4 4 b, x, b, c a, b
# 5 5 c a, b
How can I use tidyverse
/dplyr
functions to generate a new column that is the result of the following processing steps on one of a tibble
/df
's columns:
Upvotes: 1
Views: 65
Reputation: 1271
I'm not sure how efficient this is, but I just figured out I can use the mapply
function to apply a custom-built function on all rows of the input tibble as follows:
myfunc = function(in_str){
temp = unique(strsplit(in_str, ", ")[[1]])
simp = paste(sort(temp[temp != "x"]), collapse=", ")
return(simp)
}
newdf2 = df %>%
mutate(col2 = mapply(myfunc, col1))
print(newdf2)
# # A tibble: 5 x 3
# id col1 col2
# <dbl> <chr> <chr>
# 1 1 a, b, x, a a, b
# 2 2 b, b b
# 3 3 c, b, b, b b, c
# 4 4 b, x, b, c b, c
# 5 5 c c
Upvotes: 1
Reputation: 886938
We may do this in tidyverse
, by splitting with separate_rows
, and then do a group by paste
after removing the duplicates
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(col2 = col1) %>%
separate_rows(col2) %>%
distinct(across(everything())) %>%
group_by(id, col1) %>%
summarise(col2 = str_c(sort(col2[col2 != "x"]), collapse = ", "),
.groups = 'drop')
-output
# A tibble: 5 × 3
id col1 col2
<dbl> <chr> <chr>
1 1 a, b, x, a a, b
2 2 b, b b
3 3 c, b, b, b b, c
4 4 b, x, b, c b, c
5 5 c c
Upvotes: 2