Manci
Manci

Reputation: 5

If a number from one data frame fits a condition defined by another data frame, then print the information from both datasets

I have two large data frames with head() below:

Data frame one:

family_name st_pos
  <chr>        <dbl>
1 AluSp        26791
2 AluJo        31436
3 AluSx        39624
4 AluSz6       40738
5 AluYe5       51585
6 AluSc        62160  

Data frame two:

external_gene_name start_position end_position
1             ATP1A2      160115759    160143591
2               GCLM       93885199     93909456
3                TPR      186311652    186375693
4             VPS13D       12230030     12512047
5              SZRD1       16352575     16398145
6             ATP2B4      203626561    203744081

What I want to do is that if the number in st_pos from data frame one is larger than the "start_position" and smaller than the "end_position", then I'd like to print a new table with the column names indicated below.

external_gene_name    family_name     st_pos

I'm really new to R and I don't know even where to start with this. Thank you so much for "exponentiate" my learning curve.

Upvotes: 0

Views: 43

Answers (1)

Ian Campbell
Ian Campbell

Reputation: 24888

The package GenomicRanges is specifically designed to approach this problem.

As you may know, none of the sample of Alus overlaps the genes you provided. So I made some up.

library(GenomicRanges)
Alus <- GRanges(seqnames = "chr1",
                ranges = IRanges(start = df1$st_pos, width = 1),
                names = df1$family_name)
Alus
#GRanges object with 6 ranges and 1 metadata column:
#      seqnames    ranges strand |    names
#         <Rle> <IRanges>  <Rle> | <factor>
#  [1]     chr1 160115859      * |    AluSp
#  [2]     chr1  93885299      * |    AluJo
#  [3]     chr1 186312452      * |    AluSx
#  [4]     chr1  12230230      * |   AluSz6
#  [5]     chr1 203627561      * |   AluYe5
#  [6]     chr1     62160      * |    AluSc

Genes <- GRanges(seqnames = "chr1",
                 ranges = IRanges(start = df2$start_position, end = df2$end_position),
                 names = df2$external_gene_name)
Genes
#GRanges object with 6 ranges and 1 metadata column:
#      seqnames              ranges strand |    names
#         <Rle>           <IRanges>  <Rle> | <factor>
#  [1]     chr1 160115759-160143591      * |   ATP1A2
#  [2]     chr1   93885199-93909456      * |     GCLM
#  [3]     chr1 186311652-186375693      * |      TPR
#  [4]     chr1   12230030-12512047      * |   VPS13D
#  [5]     chr1   16352575-16398145      * |    SZRD1
#  [6]     chr1 203626561-203744081      * |   ATP2B4

Then you can use findOverlaps to find overlaps between the two ranges:

Overlaps <- findOverlaps(Genes,Alus)
data.frame(Genes[queryHits(Overlaps),],Alus[subjectHits(Overlaps),])
#  seqnames     start       end  width strand  names seqnames.1   start.1     end.1 width.1 strand.1 names.1
#1     chr1 160115759 160143591  27833      * ATP1A2       chr1 160115859 160115859       1        *   AluSp
#2     chr1  93885199  93909456  24258      *   GCLM       chr1  93885299  93885299       1        *   AluJo
#3     chr1 186311652 186375693  64042      *    TPR       chr1 186312452 186312452       1        *   AluSx
#4     chr1  12230030  12512047 282018      * VPS13D       chr1  12230230  12230230       1        *  AluSz6
#5     chr1 203626561 203744081 117521      * ATP2B4       chr1 203627561 203627561       1        *  AluYe5

If there were multiple overlaps per gene, there would be multiple rows.

Sample Data

df1 <- structure(list(family_name = structure(c(3L, 1L, 4L, 5L, 6L, 
2L), .Label = c("AluJo", "AluSc", "AluSp", "AluSx", "AluSz6", 
"AluYe5"), class = "factor"), st_pos = c(160115859L, 93885299L, 
186312452L, 12230230L, 203627561L, 62160L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

df2 <- structure(list(external_gene_name = structure(c(1L, 3L, 5L, 6L, 
4L, 2L), .Label = c("ATP1A2", "ATP2B4", "GCLM", "SZRD1", "TPR", 
"VPS13D"), class = "factor"), start_position = c(160115759L, 
93885199L, 186311652L, 12230030L, 16352575L, 203626561L), end_position = c(160143591L, 
93909456L, 186375693L, 12512047L, 16398145L, 203744081L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Upvotes: 2

Related Questions