Reputation: 581
Given three strings:
seq <- c("abcd", "bcde", "cdef", "af", "cdghi")
I would like to do multiple sequence alignment so that I get the following result:
abcd
bcde
cdef
a f
cd ghi
Using the msa() function from the msa package I tried:
msa(seq, type = "protein", order = "input", method = "Muscle")
and got the following result:
aln names
[1] ABCD--- Seq1
[2] -BCDE-- Seq2
[3] --CD-EF Seq3
[4] -----AF Seq4
[5] --CDGHI Seq5
Con --CD-?? Consensus
I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?
Upvotes: 2
Views: 646
Reputation: 581
A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:
library(reticulate)
builtins <- import_builtins()
lingpy <- import("lingpy")
seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")
multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)
Output:
m ɪ l - k -
m ɔˑ l - k ə
m ɛˑ l ə k -
m ɪ l ɪ x -
m ɑˑ lʲ - k -
Upvotes: 1
Reputation: 174293
Here's a solution in base R that outputs a table:
seq <- c("abcd", "bcde", "cdef", "af", "cdghi")
all_chars <- unique(unlist(strsplit(seq, "")))
tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""),
function(x) table(factor(x, all_chars)))), 1,
function(x) ifelse(x == 1, all_chars, " ")))
We can print the output without quotes to see it more clearly:
print(tab, quote = FALSE)
#> a b c d e f g h i
#> [1,] a b c d
#> [2,] b c d e
#> [3,] c d e f
#> [4,] a f
#> [5,] c d g h i
Created on 2022-05-25 by the reprex package (v2.0.1)
Upvotes: 3