merging two files based on two columns in r

Question

I have a gff file:

seqnames type ranges Name
Cch08   exon    57512691-57513280   ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;
Cch08   mRNA    57512691-57514658   ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;
Cch08   CDS     57512702-57513280   ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;

Another file (File):

SNP                 Name        beta   t.stat   p.value       FDR
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5   -12.9   -41.1    2.25e-145 1.96e-140

I tried merging both the files as I want to add type column from the gff file to the other file based on the column Name. I used:

merge(File,GFF,by="Name")

I got:

SNP                 Name        beta   t.stat   p.value       FDR       type
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237     mRNA
Cch08_58328786 C08g068610.3.2   -12.9   -41.1    2.25e-145 1.96e-140     mRNA

It is just showing mRNA for all the rows. I want it to give the type based on closest ranges column in GFF. I am not sure how can I add this criteria in the merge command. Thank you!

user12728748 · Accepted Answer

I would convert file into a GRanges object and then use distanceToNearest to get the nearest element. Example loosely based on your data:

suppressPackageStartupMessages(library(rtracklayer))
# generate gff file
write.table(
    data.frame(
        seqnames = "Cch08",
        source = "own",
        feature = c("exon", "mRNA", "CDS"),
        start = c(57512691, 57512691, 57512702),
        end = c(57513280, 57514658, 57513280),
        score = 0,
        strand = "*",
        phase = ".",
        attributes = c(
            "ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;",
            paste0("ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;",
                   "Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;"),
            "ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;"
        )
    ),
    "my.gff",
    sep = '	',
    row.names = FALSE,
    quote = FALSE,
    col.names = FALSE
)
GFF <- import("my.gff", "GFF")

File <-
    read.table(
        text = "SNP                 Name        beta   t.stat   p.value       FDR
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5   -12.9   -41.1    2.25e-145 1.96e-140",
        header = TRUE
    )

File <-
    cbind(setNames(data.frame(do.call(
        rbind, strsplit(File$SNP, "_")
    )), c("seqnames", "start")), File)
File <-
    GenomicRanges::makeGRangesFromDataFrame(File, keep.extra.columns = TRUE, end.field = "start")

dtn <- distanceToNearest(File, GFF)
File$type <- GFF[subjectHits(dtn)]$type

# optional: add ID of and distance to nearest element from GFF
File$nearest <- GFF[subjectHits(dtn)]$ID
File$dist <- values(dtn)$distance
File
#> GRanges object with 2 ranges and 9 metadata columns:
#>       seqnames    ranges strand |            SNP           Name      beta
#>              |         
#>   [1]    Cch08  57580286      * | Cch08_57580286 C08g068630.3.2      7.92
#>   [2]    Cch08  58328786      * | Cch08_58328786 C08g068610.3.5    -12.90
#>          t.stat   p.value       FDR     type        nearest      dist
#>               
#>   [1]      77.7 1.51e-242 3.71e-237     mRNA C08g068630.3.2     65627
#>   [2]     -41.1 2.25e-145 1.96e-140     mRNA C08g068630.3.2    814127
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

^{Created on 2021-05-26 by the reprex package (v2.0.0)}

merging two files based on two columns in r

Answers (2)

Data

Related Questions