Reputation: 357
In order to create all possible string combinations from input sentences, I made the code lines below.
library(stringr)
text = c('I like you', 'I love you so much', 'she like it so much', 'she hate you', 'he hate you so much','I like him')
tex = data.frame(text)
library(splitstackshape)
pattern = data.frame(cSplit(tex, "text", " "))
n=ncol(pattern)
dat = c()
for(i in 1:n){
tt = unique(pattern[,i])
g=paste0(tt,collapse = ' ')
dat = c(dat,g)
SEQ = data.frame(dat)
}
SEQ = data.frame(cSplit(SEQ, "dat", " "))
It can form this data frame.
dat_1 dat_2 dat_3
1 I she he
2 like love hate
3 you it him
4 <NA> so <NA>
5 <NA> much <NA>
What I want is to create all possible combinations (108) of the words like below.
I like you so NA
I like you so much
I like you NA NA
I like you NA much
...
he love him so much
he love him NA NA
he love him NA much
he hate you so NA
he hate you so much
...
What should I do to make these lists?
Upvotes: 1
Views: 84
Reputation: 67778
I think data.table::tstrsplit
is convenient for splitting and transposing. Then, select unique values of each list element (lapply(x, unique)
), and make all combinations (expand.grid
)
expand.grid(lapply(data.table::tstrsplit(text, split = " "), unique))
# Var1 Var2 Var3 Var4 Var5
# 1 I like you <NA> <NA>
# 2 she like you <NA> <NA>
# 3 he like you <NA> <NA>
# 4 I love you <NA> <NA>
# 5 she love you <NA> <NA>
# [snip]
# 104 she love him so much
# 105 he love him so much
# 106 I hate him so much
# 107 she hate him so much
# 108 he hate him so much
You may also use the data.table
equivalent of expand.grid
, CJ
, which has a unique
argument.
library(data.table)
do.call(CJ, c(tstrsplit(text, split = " "), unique = TRUE))
# V1 V2 V3 V4 V5
# 1: I hate him <NA> <NA>
# 2: I hate him <NA> much
# 3: I hate him so <NA>
# 4: I hate him so much
# 5: I hate it <NA> <NA>
# ---
# 104: she love it so much
# 105: she love you <NA> <NA>
# 106: she love you <NA> much
# 107: she love you so <NA>
# 108: she love you so much
Upvotes: 2
Reputation: 887098
From the "pattern" dataset, we can also use expand
from tidyr
library(tidyr)
expand(pattern, !!! rlang::syms(names(pattern)))
Or we can use separate
with expand
library(tidyverse)
mx <- max(str_count(tex$text, "\\w+"))
tex %>%
separate(text, into = paste0("dat_", seq_len(mx))) %>%
expand(!!! rlang::syms(names(.)))
Upvotes: 2