Reputation: 773
So I have two list objects in R and I want to know which of the sequences can bind to each other through DNA complementarity.
The first object rs is reverse complement microRNA seed regions and the second is 3'UTRs motifs.
Any lead on how to solve this problem?
I found a package called microRNAs (https://www.bioconductor.org/packages/release/bioc/manuals/microRNA/man/microRNA.pdf) with a function called matchSeeds(seed, seq). I did this but this function is actually looking for exact matches, which is not exactly what I need. Any lead on how to solve this in R will be very much appreciated.
Thanks!
> typeof(rs)
[1] "list"
> typeof(u)
[1] "list"
head(rs)
$`miR-92|34108_3p `
[1] "TGCAAT"
$`miR-92|34106_3p `
[1] "TGCAAT"
$`miR-92|34110_3p `
[1] "TGCAAT"
$`miR-184|1952_3p `
[1] "CCGTCC"
$`miR-184|1954_3p `
[1] "CCGTCC"
$`miR-1795_3p `
[1] "CCGTCC"
head(u)
$upper_1
[1] "gccgtt"
$upper_2
[1] "ccgagc"
$upper_3
[1] "gacatt"
$upper_4
[1] "gcttat"
$upper_5
[1] "taccta"
$upper_6
[1] "tcgtct"
Upvotes: 0
Views: 275
Reputation: 19716
If you want to find if any substrings in rs
list are complementary to the strings in u
list, and you want it to be performant you can use package Biostrings function matchPDict
.
Example:
library(Biostrings)
library(IRanges)
lis <- list(`miR-92|34108_3p ` = "TGCAAT",
`miR-92|34106_3p ` = "TGCAAT",
`miR-92|34110_3p ` = "TGCAAT")
u <- list(upper_1 ="gccgtt",
upper_2 = "ccgagc",
upper_3 = "gacatt",
upper_4 = "gcttat",
upper_5 = "taccta")
Convert first list to DNAStringSet:
lis <- DNAStringSetList(lis)
lis <- unlist(lis)
Convert second list to DNAStringSet:
u <- DNAStringSetList(u)
u <- unlist(u)
Get the complement of lis
lis_rc <- complement(lis)
create a PDict so you can match it fast vs the other list
pdict0 <- PDict(lis_rc)
iterate over list u
running matchPDict
lapply(u, function(x) matchPDict(pdict0, x))
EDIT: if you want to check any orientation you can create them using accesory functions such as complement
, reverse
and reverseComplement
and provide that to PDict
:
lis_rc <- c(lis,
complement(lis),
reverse(lis),
reverseComplement(lis))
names(lis_rc) <- paste(trimws(names(lis_rc)), rep(c("",
"c",
"r",
"rc"),
each = length(lis)),
sep = "_")
pdict0 <- PDict(lis_rc)
res <- lapply(u, function(x) matchPDict(pdict0, x))
res
is a list of IRanges objects
you can check where the hits are with
lapply(res, width)
lapply(res, start)
lapply(res, end)
EDIT2:
if you just want to count the matches without the match coordinates you can simply use:
vcountPDict(pdict0, u)
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0
[6,] 0 0 0 0 0
[7,] 0 0 0 0 0
[8,] 0 0 0 0 0
[9,] 0 0 0 0 0
[10,] 0 0 0 0 0
[11,] 0 0 0 0 0
[12,] 0 0 0 0 0
rows correspond to sequences in pdict0
, while columns correspond to sequences in u
:
mat <- vcountPDict(pdict0, u)
rownames(mat) <- names(lis_rc)
colnames(mat) <- names(u)
upper_1 upper_2 upper_3 upper_4 upper_5
miR-92|34108_3p_ 0 0 0 0 0
miR-92|34106_3p_ 0 0 0 0 0
miR-92|34110_3p_ 0 0 0 0 0
miR-92|34108_3p_c 0 0 0 0 0
miR-92|34106_3p_c 0 0 0 0 0
miR-92|34110_3p_c 0 0 0 0 0
miR-92|34108_3p_r 0 0 0 0 0
miR-92|34106_3p_r 0 0 0 0 0
miR-92|34110_3p_r 0 0 0 0 0
miR-92|34108_3p_rc 0 0 0 0 0
miR-92|34106_3p_rc 0 0 0 0 0
miR-92|34110_3p_rc 0 0 0 0 0
Upvotes: 1