WJH
WJH

Reputation: 581

How to do multiple sequence alignment of text strings (utf8) in R

Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus   

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?

Upvotes: 2

Views: 646

Answers (2)

WJH
WJH

Reputation: 581

A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

Output:

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 174293

Here's a solution in base R that outputs a table:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

We can print the output without quotes to see it more clearly:

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

Created on 2022-05-25 by the reprex package (v2.0.1)

Upvotes: 3

Related Questions