Align many short sequences from two lists and find complementary

Question

So I have two list objects in R and I want to know which of the sequences can bind to each other through DNA complementarity.

The first object rs is reverse complement microRNA seed regions and the second is 3'UTRs motifs. Any lead on how to solve this problem?
I found a package called microRNAs (https://www.bioconductor.org/packages/release/bioc/manuals/microRNA/man/microRNA.pdf) with a function called matchSeeds(seed, seq). I did this but this function is actually looking for exact matches, which is not exactly what I need. Any lead on how to solve this in R will be very much appreciated. Thanks!

> typeof(rs)
[1] "list"
> typeof(u)
[1] "list"

head(rs)
$`miR-92|34108_3p `
[1] "TGCAAT"

$`miR-92|34106_3p `
[1] "TGCAAT"

$`miR-92|34110_3p `
[1] "TGCAAT"

$`miR-184|1952_3p `
[1] "CCGTCC"

$`miR-184|1954_3p `
[1] "CCGTCC"

$`miR-1795_3p `
[1] "CCGTCC"

head(u)
$upper_1
[1] "gccgtt"

$upper_2
[1] "ccgagc"

$upper_3
[1] "gacatt"

$upper_4
[1] "gcttat"

$upper_5
[1] "taccta"

$upper_6
[1] "tcgtct"

missuse · Accepted Answer

If you want to find if any substrings in rs list are complementary to the strings in u list, and you want it to be performant you can use package Biostrings function matchPDict.

Example:

library(Biostrings)
library(IRanges)

lis <- list(`miR-92|34108_3p ` = "TGCAAT",
            `miR-92|34106_3p ` = "TGCAAT",
            `miR-92|34110_3p ` = "TGCAAT")

u <- list(upper_1 ="gccgtt",
          upper_2 = "ccgagc",
          upper_3 = "gacatt",
          upper_4 = "gcttat",
          upper_5 = "taccta")

Convert first list to DNAStringSet:

lis <- DNAStringSetList(lis)
lis <- unlist(lis)

Convert second list to DNAStringSet:

u <- DNAStringSetList(u)
u <- unlist(u)

Get the complement of lis

lis_rc <- complement(lis)

create a PDict so you can match it fast vs the other list

pdict0 <- PDict(lis_rc)

iterate over list u running matchPDict

lapply(u, function(x) matchPDict(pdict0, x))

EDIT: if you want to check any orientation you can create them using accesory functions such as complement, reverse and reverseComplement and provide that to PDict:

lis_rc <- c(lis,
            complement(lis),
            reverse(lis),
            reverseComplement(lis))

names(lis_rc) <- paste(trimws(names(lis_rc)), rep(c("",
                                            "c",
                                            "r",
                                            "rc"),
                                          each = length(lis)),
                                            sep = "_")

pdict0 <- PDict(lis_rc)

res <- lapply(u, function(x) matchPDict(pdict0, x))

res is a list of IRanges objects

you can check where the hits are with

lapply(res, width)
lapply(res, start)
lapply(res, end)

EDIT2:

if you just want to count the matches without the match coordinates you can simply use:

vcountPDict(pdict0, u)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    0    0    0    0    0
 [2,]    0    0    0    0    0
 [3,]    0    0    0    0    0
 [4,]    0    0    0    0    0
 [5,]    0    0    0    0    0
 [6,]    0    0    0    0    0
 [7,]    0    0    0    0    0
 [8,]    0    0    0    0    0
 [9,]    0    0    0    0    0
[10,]    0    0    0    0    0
[11,]    0    0    0    0    0
[12,]    0    0    0    0    0

rows correspond to sequences in pdict0, while columns correspond to sequences in u:

mat <- vcountPDict(pdict0, u)
rownames(mat) <- names(lis_rc)
colnames(mat) <- names(u)

                   upper_1 upper_2 upper_3 upper_4 upper_5
miR-92|34108_3p_         0       0       0       0       0
miR-92|34106_3p_         0       0       0       0       0
miR-92|34110_3p_         0       0       0       0       0
miR-92|34108_3p_c        0       0       0       0       0
miR-92|34106_3p_c        0       0       0       0       0
miR-92|34110_3p_c        0       0       0       0       0
miR-92|34108_3p_r        0       0       0       0       0
miR-92|34106_3p_r        0       0       0       0       0
miR-92|34110_3p_r        0       0       0       0       0
miR-92|34108_3p_rc       0       0       0       0       0
miR-92|34106_3p_rc       0       0       0       0       0
miR-92|34110_3p_rc       0       0       0       0       0

Align many short sequences from two lists and find complementary

Answers (1)

Related Questions