Reputation: 38043
I would like to store a GenomicRanges::GRanges
object from Bioconductor as a single column in a base R data.frame
. The reason I'd like to have it in a base R data.frame is because I'd like to write some ggplot2 functions that exclusively work with data.frames under the hood. However, any attempts I made don't seem to be fruitful. Basically this is what I want to do:
library(GenomicRanges)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(x = x, y = 1:2)
But the column is automatically expanded, whereas I like to keep it as a valid GRanges
object in a single column:
> df
x.seqnames x.start x.end x.width x.strand y
1 chr1 100 200 101 * 1
2 chr1 200 300 101 * 2
When I work with the S4Vectors::DataFrame
, it works as I want, except I'd like a base R data.frame to do the same thing:
> S4Vectors::DataFrame(x = x, y = 1:2)
DataFrame with 2 rows and 2 columns
x y
<GRanges> <integer>
1 chr1:100-200 1
2 chr1:200-300 2
I also tried the following without succes:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2 <NA>
Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs
df[["x"]] <- I(x)
Error in rep(value, length.out = nrows) : attempt to replicate an object of type 'S4'
I had some minor succes with implementing an S3 variant of the GRanges class using vctrs::new_rcrd
, but that seems to be a very roundabout way to get a single column representing a genomic range.
Upvotes: 2
Views: 4437
Reputation: 757
I found a very simple way to convert an GR object to a dataframe so that you can operate on the data.frame very easily. The annoGR2DF function in the Repitools package can do so.
> library(GenomicRanges)
> library(Repitools)
>
> x <- GRanges(c("chr1:100-200", "chr1:200-300"))
>
> df <- annoGR2DF(x)
> df
chr start end width
1 chr1 100 200 101
2 chr1 200 300 101
> class(df)
[1] "data.frame"
Upvotes: 4
Reputation: 38043
So since posting this question, I figured out that the crux of my problem seemed to be that just the format method of S4 objects is not playing nicely with the data.frames, and having GRanges as columns isn't necessarily a problem. (The construction of the data.frame still is though).
Consider this bit of the original question:
> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
y x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2
Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs
If we write a simple format method for GRanges, it will not throw an error:
library(GenomicRanges)
format.GRanges <- function(x, ...) {showAsCell(x)}
df <- data.frame(y = 1:3)
df$x <- GRanges(c("chr1:100-200", "chr1:200-300", "chr2:100-200"))
> df
y x
1 1 chr1:100-200
2 2 chr1:200-300
3 3 chr2:100-200
It seems to subset just fine too:
> df[c(1,3),]
y x
1 1 chr1:100-200
3 3 chr2:100-200
As a bonus, this seems to work for other S4 classes too, for example:
library(S4Vectors)
format.Rle <- function(x, ...) {showAsCell(x)}
x <- Rle(1:5, 5:1)
df <- data.frame(y = 1:15)
df$x <- x
Upvotes: 0
Reputation: 138
A not pretty but practical solution is to use the accessor functions of GenomicRanges, then convert to the relevant data vector, i.e. numeric or character. I added magrittr, but you can also do it without it.
library(GenomicRanges)
library(magrittr)
x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(y = 1:2)
df$chr <- seqnames(x) %>% as.character
df$start <- start(x) %>% as.numeric
df$end <- end(x) %>% as.numeric
df$strand <- strand(x) %>% as.character
df$width <- width(x) %>% as.numeric
df
Upvotes: 0