Ablum89
Ablum89

Reputation: 35

How to match strings in different combinations in R

I have a data frame df with words separated by + but don't want the order to matter when I perform analysis. For instance, I have

df <- as.data.frame(
      c(("Yellow + Blue + Green"),
        ("Blue + Yellow + Green"),
        ("Green + Yellow + Blue")))

Currently, they are being treated as three unique responses but I want them to be considered the same. I have tried brute force methods such as ifelse statements but they don't lend themselves well to large datasets.

Is there a way to rearrange the terms so they match or something like a reverse combn function that recognizes they are the same combination but in a different order?

Thanks!

Upvotes: 3

Views: 238

Answers (2)

CPak
CPak

Reputation: 13581

I wanted to provide my take on this since it's not clear what format you want your output:

I use packages stringr and iterators. Using the df created by d.b.

search <- c("Yellow", "Green", "Blue")
L <- str_extract_all(df$cols, boundary("word"))
sapply(iter(L), function(x) all(search %in% x))
[1]  TRUE  TRUE  TRUE FALSE

Upvotes: 0

d.b
d.b

Reputation: 32548

#DATA
df <- data.frame(cols = 
                 c(("Yellow + Blue + Green"),
                   ("Blue + Yellow + Green"),
                   ("Green + Yellow + Blue"),
                   ("Green + Yellow + Red")), stringsAsFactors = FALSE)

#Split, sort, and then paste together
df$group = sapply(df$cols, function(a)
    paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))
df
#                   cols               group
#1 Yellow + Blue + Green Blue, Green, Yellow
#2 Blue + Yellow + Green Blue, Green, Yellow
#3 Green + Yellow + Blue Blue, Green, Yellow
#4  Green + Yellow + Red  Green, Red, Yellow

#Or you can convert to factors too (and back to numeric, if you like)
df$group2 = as.numeric(as.factor(sapply(df$cols, function(a)
        paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", "))))
df
#                   cols               group group2
#1 Yellow + Blue + Green Blue, Green, Yellow      1
#2 Blue + Yellow + Green Blue, Green, Yellow      1
#3 Green + Yellow + Blue Blue, Green, Yellow      1
#4  Green + Yellow + Red  Green, Red, Yellow      2

Upvotes: 6

Related Questions