Reputation: 5
I have two large data frames with head() below:
Data frame one:
family_name st_pos
<chr> <dbl>
1 AluSp 26791
2 AluJo 31436
3 AluSx 39624
4 AluSz6 40738
5 AluYe5 51585
6 AluSc 62160
Data frame two:
external_gene_name start_position end_position
1 ATP1A2 160115759 160143591
2 GCLM 93885199 93909456
3 TPR 186311652 186375693
4 VPS13D 12230030 12512047
5 SZRD1 16352575 16398145
6 ATP2B4 203626561 203744081
What I want to do is that if the number in st_pos from data frame one is larger than the "start_position" and smaller than the "end_position", then I'd like to print a new table with the column names indicated below.
external_gene_name family_name st_pos
I'm really new to R and I don't know even where to start with this. Thank you so much for "exponentiate" my learning curve.
Upvotes: 0
Views: 43
Reputation: 24888
The package GenomicRanges
is specifically designed to approach this problem.
As you may know, none of the sample of Alus overlaps the genes you provided. So I made some up.
library(GenomicRanges)
Alus <- GRanges(seqnames = "chr1",
ranges = IRanges(start = df1$st_pos, width = 1),
names = df1$family_name)
Alus
#GRanges object with 6 ranges and 1 metadata column:
# seqnames ranges strand | names
# <Rle> <IRanges> <Rle> | <factor>
# [1] chr1 160115859 * | AluSp
# [2] chr1 93885299 * | AluJo
# [3] chr1 186312452 * | AluSx
# [4] chr1 12230230 * | AluSz6
# [5] chr1 203627561 * | AluYe5
# [6] chr1 62160 * | AluSc
Genes <- GRanges(seqnames = "chr1",
ranges = IRanges(start = df2$start_position, end = df2$end_position),
names = df2$external_gene_name)
Genes
#GRanges object with 6 ranges and 1 metadata column:
# seqnames ranges strand | names
# <Rle> <IRanges> <Rle> | <factor>
# [1] chr1 160115759-160143591 * | ATP1A2
# [2] chr1 93885199-93909456 * | GCLM
# [3] chr1 186311652-186375693 * | TPR
# [4] chr1 12230030-12512047 * | VPS13D
# [5] chr1 16352575-16398145 * | SZRD1
# [6] chr1 203626561-203744081 * | ATP2B4
Then you can use findOverlaps
to find overlaps between the two ranges:
Overlaps <- findOverlaps(Genes,Alus)
data.frame(Genes[queryHits(Overlaps),],Alus[subjectHits(Overlaps),])
# seqnames start end width strand names seqnames.1 start.1 end.1 width.1 strand.1 names.1
#1 chr1 160115759 160143591 27833 * ATP1A2 chr1 160115859 160115859 1 * AluSp
#2 chr1 93885199 93909456 24258 * GCLM chr1 93885299 93885299 1 * AluJo
#3 chr1 186311652 186375693 64042 * TPR chr1 186312452 186312452 1 * AluSx
#4 chr1 12230030 12512047 282018 * VPS13D chr1 12230230 12230230 1 * AluSz6
#5 chr1 203626561 203744081 117521 * ATP2B4 chr1 203627561 203627561 1 * AluYe5
If there were multiple overlaps per gene, there would be multiple rows.
Sample Data
df1 <- structure(list(family_name = structure(c(3L, 1L, 4L, 5L, 6L,
2L), .Label = c("AluJo", "AluSc", "AluSp", "AluSx", "AluSz6",
"AluYe5"), class = "factor"), st_pos = c(160115859L, 93885299L,
186312452L, 12230230L, 203627561L, 62160L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(external_gene_name = structure(c(1L, 3L, 5L, 6L,
4L, 2L), .Label = c("ATP1A2", "ATP2B4", "GCLM", "SZRD1", "TPR",
"VPS13D"), class = "factor"), start_position = c(160115759L,
93885199L, 186311652L, 12230030L, 16352575L, 203626561L), end_position = c(160143591L,
93909456L, 186375693L, 12512047L, 16398145L, 203744081L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 2