user1567654
user1567654

Reputation: 47

merging two files based on two columns in r

I have a gff file:

seqnames type ranges Name
Cch08   exon    57512691-57513280   ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;
Cch08   mRNA    57512691-57514658   ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;
Cch08   CDS     57512702-57513280   ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;

Another file (File):

SNP                 Name        beta   t.stat   p.value       FDR
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5   -12.9   -41.1    2.25e-145 1.96e-140

I tried merging both the files as I want to add type column from the gff file to the other file based on the column Name. I used:

merge(File,GFF,by="Name")

I got:

SNP                 Name        beta   t.stat   p.value       FDR       type
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237     mRNA
Cch08_58328786 C08g068610.3.2   -12.9   -41.1    2.25e-145 1.96e-140     mRNA

It is just showing mRNA for all the rows. I want it to give the type based on closest ranges column in GFF. I am not sure how can I add this criteria in the merge command. Thank you!

Upvotes: 0

Views: 140

Answers (2)

user12728748
user12728748

Reputation: 8506

I would convert file into a GRanges object and then use distanceToNearest to get the nearest element. Example loosely based on your data:

suppressPackageStartupMessages(library(rtracklayer))
# generate gff file
write.table(
    data.frame(
        seqnames = "Cch08",
        source = "own",
        feature = c("exon", "mRNA", "CDS"),
        start = c(57512691, 57512691, 57512702),
        end = c(57513280, 57514658, 57513280),
        score = 0,
        strand = "*",
        phase = ".",
        attributes = c(
            "ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;",
            paste0("ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;",
                   "Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;"),
            "ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;"
        )
    ),
    "my.gff",
    sep = '\t',
    row.names = FALSE,
    quote = FALSE,
    col.names = FALSE
)
GFF <- import("my.gff", "GFF")

File <-
    read.table(
        text = "SNP                 Name        beta   t.stat   p.value       FDR
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5   -12.9   -41.1    2.25e-145 1.96e-140",
        header = TRUE
    )

File <-
    cbind(setNames(data.frame(do.call(
        rbind, strsplit(File$SNP, "_")
    )), c("seqnames", "start")), File)
File <-
    GenomicRanges::makeGRangesFromDataFrame(File, keep.extra.columns = TRUE, end.field = "start")

dtn <- distanceToNearest(File, GFF)
File$type <- GFF[subjectHits(dtn)]$type

# optional: add ID of and distance to nearest element from GFF
File$nearest <- GFF[subjectHits(dtn)]$ID
File$dist <- values(dtn)$distance
File
#> GRanges object with 2 ranges and 9 metadata columns:
#>       seqnames    ranges strand |            SNP           Name      beta
#>          <Rle> <IRanges>  <Rle> |    <character>    <character> <numeric>
#>   [1]    Cch08  57580286      * | Cch08_57580286 C08g068630.3.2      7.92
#>   [2]    Cch08  58328786      * | Cch08_58328786 C08g068610.3.5    -12.90
#>          t.stat   p.value       FDR     type        nearest      dist
#>       <numeric> <numeric> <numeric> <factor>    <character> <integer>
#>   [1]      77.7 1.51e-242 3.71e-237     mRNA C08g068630.3.2     65627
#>   [2]     -41.1 2.25e-145 1.96e-140     mRNA C08g068630.3.2    814127
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2021-05-26 by the reprex package (v2.0.0)

Upvotes: 1

Martin Gal
Martin Gal

Reputation: 16978

I'm not 100% sure what you are trying to do, so I propose a solution for the problem I believe you want to solve.

First we separate the ranges in gff:

library(dplyr)
library(tidyr)

gff_new <- gff %>%
  separate(ranges, c("min", "max"), convert=TRUE)

Now we join the two data.frames

file %>%
  separate(SNP, c("seqnames", "value"), convert=TRUE) %>%
  left_join(gff_new, by=c("seqnames")) %>%
  group_by(seqnames, Name.x, type) %>%
  mutate(dist = min(abs(value - min), abs(value - max))) %>%
  group_by(seqnames, Name.x) %>%
  filter(dist==min(dist)) %>%
  mutate(SNP = paste0(seqnames, "_", value)) %>%
  ungroup() %>%
  select(SNP, Name=Name.x, beta, t.stat, p.value, FDR, type)

which yields

# A tibble: 2 x 7
  SNP            Name             beta t.stat   p.value       FDR type 
  <chr>          <chr>           <dbl>  <dbl>     <dbl>     <dbl> <chr>
1 Cch08_57580286 C08g068630.3.2   7.92   77.7 1.51e-242 3.71e-237 mRNA 
2 Cch08_58328786 C08g068610.3.5 -12.9   -41.1 2.25e-145 1.96e-140 mRNA 

Data

gff <- readr::read_table2("seqnames type ranges Name
Cch08   exon    57512691-57513280   ID=C08g068630.3.2:exon:2;Parent=C08g068630.3.2 ;
Cch08   mRNA    57512691-57514658   ID=C08g068630.3.2;Parent=C08g068630.3.2 ;Name=C08g068630.3.2;Note=Similar to C08g068630.3.2: ribosomal protein;ref_id=C08g068630.3.2;
Cch08   CDS     57512702-57513280   ID=C08g068630.3.2:cds;Parent=C08g068630.3.2 ;")

file <- readr::read_table2("SNP                 Name        beta   t.stat   p.value       FDR
Cch08_57580286 C08g068630.3.2    7.92   77.7     1.51e-242 3.71e-237
Cch08_58328786 C08g068610.3.5   -12.9   -41.1    2.25e-145 1.96e-140")

Upvotes: 1

Related Questions