Reputation: 47
I have a gff
file:
seqnames type ranges Name
Cch08 exon 57512691-57513280 ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;
Cch08 mRNA 57512691-57514658 ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;
Cch08 CDS 57512702-57513280 ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;
Another file (File
):
SNP Name beta t.stat p.value FDR
Cch08_57580286 C08g068630.3.2 7.92 77.7 1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5 -12.9 -41.1 2.25e-145 1.96e-140
I tried merging both the files as I want to add type
column from the gff
file to the other file based on the column Name
. I used:
merge(File,GFF,by="Name")
I got:
SNP Name beta t.stat p.value FDR type
Cch08_57580286 C08g068630.3.2 7.92 77.7 1.51e-242 3.71e-237 mRNA
Cch08_58328786 C08g068610.3.2 -12.9 -41.1 2.25e-145 1.96e-140 mRNA
It is just showing mRNA
for all the rows. I want it to give the type
based on closest ranges
column in GFF
. I am not sure how can I add this criteria in the merge
command. Thank you!
Upvotes: 0
Views: 140
Reputation: 8506
I would convert file
into a GRanges
object and then use distanceToNearest
to get the nearest element. Example loosely based on your data:
suppressPackageStartupMessages(library(rtracklayer))
# generate gff file
write.table(
data.frame(
seqnames = "Cch08",
source = "own",
feature = c("exon", "mRNA", "CDS"),
start = c(57512691, 57512691, 57512702),
end = c(57513280, 57514658, 57513280),
score = 0,
strand = "*",
phase = ".",
attributes = c(
"ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;",
paste0("ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;",
"Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;"),
"ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;"
)
),
"my.gff",
sep = '\t',
row.names = FALSE,
quote = FALSE,
col.names = FALSE
)
GFF <- import("my.gff", "GFF")
File <-
read.table(
text = "SNP Name beta t.stat p.value FDR
Cch08_57580286 C08g068630.3.2 7.92 77.7 1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5 -12.9 -41.1 2.25e-145 1.96e-140",
header = TRUE
)
File <-
cbind(setNames(data.frame(do.call(
rbind, strsplit(File$SNP, "_")
)), c("seqnames", "start")), File)
File <-
GenomicRanges::makeGRangesFromDataFrame(File, keep.extra.columns = TRUE, end.field = "start")
dtn <- distanceToNearest(File, GFF)
File$type <- GFF[subjectHits(dtn)]$type
# optional: add ID of and distance to nearest element from GFF
File$nearest <- GFF[subjectHits(dtn)]$ID
File$dist <- values(dtn)$distance
File
#> GRanges object with 2 ranges and 9 metadata columns:
#> seqnames ranges strand | SNP Name beta
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric>
#> [1] Cch08 57580286 * | Cch08_57580286 C08g068630.3.2 7.92
#> [2] Cch08 58328786 * | Cch08_58328786 C08g068610.3.5 -12.90
#> t.stat p.value FDR type nearest dist
#> <numeric> <numeric> <numeric> <factor> <character> <integer>
#> [1] 77.7 1.51e-242 3.71e-237 mRNA C08g068630.3.2 65627
#> [2] -41.1 2.25e-145 1.96e-140 mRNA C08g068630.3.2 814127
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Created on 2021-05-26 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 16978
I'm not 100% sure what you are trying to do, so I propose a solution for the problem I believe you want to solve.
First we separate the ranges in gff:
library(dplyr)
library(tidyr)
gff_new <- gff %>%
separate(ranges, c("min", "max"), convert=TRUE)
Now we join the two data.frames
file %>%
separate(SNP, c("seqnames", "value"), convert=TRUE) %>%
left_join(gff_new, by=c("seqnames")) %>%
group_by(seqnames, Name.x, type) %>%
mutate(dist = min(abs(value - min), abs(value - max))) %>%
group_by(seqnames, Name.x) %>%
filter(dist==min(dist)) %>%
mutate(SNP = paste0(seqnames, "_", value)) %>%
ungroup() %>%
select(SNP, Name=Name.x, beta, t.stat, p.value, FDR, type)
which yields
# A tibble: 2 x 7
SNP Name beta t.stat p.value FDR type
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Cch08_57580286 C08g068630.3.2 7.92 77.7 1.51e-242 3.71e-237 mRNA
2 Cch08_58328786 C08g068610.3.5 -12.9 -41.1 2.25e-145 1.96e-140 mRNA
gff <- readr::read_table2("seqnames type ranges Name
Cch08 exon 57512691-57513280 ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;
Cch08 mRNA 57512691-57514658 ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;
Cch08 CDS 57512702-57513280 ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;")
file <- readr::read_table2("SNP Name beta t.stat p.value FDR
Cch08_57580286 C08g068630.3.2 7.92 77.7 1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5 -12.9 -41.1 2.25e-145 1.96e-140")
Upvotes: 1