Reputation: 3
I have a large dataframe with scaffold annotations (example rows):
gff <- data.frame(seqid = c("Scaffold21", "Scaffold21", "Scaffold21", "Scaffold31", "Scaffold31", "Scaffold11561", "Scaffold11561"),
start = c(4179,16947,18411,25986,45575, 52,54100),
end = c(4697,17667,19643,32223,46657,1572,54627),
attributes = c("tRNA","sRNA","exon","rRNA","mRNA","mRNA","exon"))
And I have another dataframe with RNA coordinates (Example rows)
RNA <- data.frame(seqid = c("Scaffold21", "Scaffold11561"),
start = c(17047,1380))
I've been trying to filter the first dataframe to annotate the RNAs in the second one using:
scaffold <- unique(RNA$seqid)
coord <- RNA$start
n <- length(scaffold)*length(coord)
output <- matrix(ncol = ncol(gff), nrow = n)
myfunc <- function(x,y){gff[gff$seqid == x & gff$start <= y & gff$end >= y,]}
for (x in scaffold) {
for (y in coord) {
test = myfunc(x, y)
output <- test
}
}
The problem here is that only the information about the last x,y pair is being stored. I'd really appreciate if someone could help me to fix this.
The output that I'm getting now looks like: |seqid|start|end| |:----|:----|:--| |Scaffold11561|52|1572|mRNA|
Ideally, it would look like:
seqid | start | end |
---|---|---|
Scaffold21 | 16947 | 17667 |
Scaffold11561 | 52 | 1572 |
Upvotes: 0
Views: 196
Reputation: 150
given your sample code you could use something like:
scaffold <- unique(RNA$seqid)
coord <- RNA$start
n <- length(scaffold)*length(coord)
output <- data.frame(matrix(ncol = ncol(gff), nrow = n)) #matrix can only store one type
myfunc <- function(x,y){gff[gff$seqid == x & gff$start <= y & gff$end >= y,]}
i <- 0L
for (x in scaffold) {
for (y in coord) {
i <- i + 1L
test <- myfunc(x, y)
if(nrow(test) != 1) next
output[i, ] <- test
}
}
output <- na.omit(output)
This is probably slow if have a lot of rows. You could also think about using joins. For example:
a<- merge(gff, RNA, by = "seqid")
a[(a$start.x <= a$start.y) & (a$end >= a$start.y),]
Upvotes: 1