Ian
Ian

Reputation: 79

Combining two words to produce all possible character combinations

I have pairs of words that are transcribed in ARPABET. I am trying to combine these words such that every possible segment sequence, assuming strict ordering, is produced. An example would look like:

word1   transcription1   word2   transcription2
dog     D AA G           cat     K AE T  

combining transcription1 and transcription2 would result in something like below where it iterates by segment. For the purposes of this toy example, I've not included instances of no segment from the second word being combined (i.e., dog+cat = dog), but it's probably in the logical space.

D K AE T 
D AE T 
D T 

D AA K AE T 
D AA AE T 
D AA T 

D AA G K AE T 
D AA G AE T
D AA G T 
D AA G

K D AA G 
K AA G
K G

K AE D AA G
K AE AA G 
K AE G 

K AE T D AA G 
K AE T AA G
K AE T G 

The eventual goal is to do some quantitative analysis on each of these outputs, so saving them to a large data frame would be ideal, although it might become unwieldy with the amount of data I am working with (~900 pairs of words, 3-7 segments each). Any help on this problem would be great.

Upvotes: 2

Views: 102

Answers (2)

Darren Tsai
Darren Tsai

Reputation: 35554

My handmade function which only uses base functions.

fun <- function(x, y){
  x <- strsplit(x, " ")[[1]]
  y <- strsplit(y, " ")[[1]]
  apply(do.call(expand.grid, lapply(c(x, y), c, NA)),
        1, function(x) paste(x[!is.na(x)], collapse = " "))
}

fun("D AA G", "K AE T")

#  [1] "D AA G K AE T" "AA G K AE T"   "D G K AE T"    "G K AE T"     
#  [5] "D AA K AE T"   "AA K AE T"     "D K AE T"      "K AE T"       
#  [9] "D AA G AE T"   "AA G AE T"     "D G AE T"      "G AE T"       
# [13] "D AA AE T"     "AA AE T"       "D AE T"        "AE T"         
# [17] "D AA G K T"    "AA G K T"      "D G K T"       "G K T"        
# [21] "D AA K T"      "AA K T"        "D K T"         "K T"          
# [25] "D AA G T"      "AA G T"        "D G T"         "G T"          
# [29] "D AA T"        "AA T"          "D T"           "T"            
# [33] "D AA G K AE"   "AA G K AE"     "D G K AE"      "G K AE"       
# [37] "D AA K AE"     "AA K AE"       "D K AE"        "K AE"         
# [41] "D AA G AE"     "AA G AE"       "D G AE"        "G AE"         
# [45] "D AA AE"       "AA AE"         "D AE"          "AE"           
# [49] "D AA G K"      "AA G K"        "D G K"         "G K"          
# [53] "D AA K"        "AA K"          "D K"           "K"            
# [57] "D AA G"        "AA G"          "D G"           "G"            
# [61] "D AA"          "AA"            "D"             ""   

Upvotes: 2

thc
thc

Reputation: 9705

Here's a simple function to do so:

library(dplyr)

segment_sequences <- function(x, y) {
  x <- strsplit(x, " ") %>% unlist
  y <- strsplit(y, " ") %>% unlist
  z <- c(x,y)
  sapply(seq_along(z), function(j) {
    combos <- combn(seq_along(z), j, simplify = FALSE)
    sapply(combos, function(cb) paste0(z[cb], collapse=" "))
  }) %>% do.call(c,.)
}

segment_sequences("D AA G","K AE T")

[1] "D"             "AA"            "G"             "K"             "AE"            "T"             "D AA"          "D G"           "D K"           "D AE"          "D T"           "AA G"          "AA K"          "AA AE"         "AA T"          "G K"           "G AE"         
[18] "G T"           "K AE"          "K T"           "AE T"          "D AA G"        "D AA K"        "D AA AE"       "D AA T"        "D G K"         "D G AE"        "D G T"         "D K AE"        "D K T"         "D AE T"        "AA G K"        "AA G AE"       "AA G T"       
[35] "AA K AE"       "AA K T"        "AA AE T"       "G K AE"        "G K T"         "G AE T"        "K AE T"        "D AA G K"      "D AA G AE"     "D AA G T"      "D AA K AE"     "D AA K T"      "D AA AE T"     "D G K AE"      "D G K T"       "D G AE T"      "D K AE T"     
[52] "AA G K AE"     "AA G K T"      "AA G AE T"     "AA K AE T"     "G K AE T"      "D AA G K AE"   "D AA G K T"    "D AA G AE T"   "D AA K AE T"   "D G K AE T"    "AA G K AE T"   "D AA G K AE T"

Upvotes: 3

Related Questions