Reputation: 656

64x R in linux cluster with 200Gb: plyr package not enough to join tables

I'm running into an issue trying to join multiple data.frames using the plyr package in R.

Perhaps this is not the best tool/language for joining tables for large datasets? If anyone can suggest alternatives it would be greatly appreciated (e.g. alternative R packages, unix, or maybe MapReduce/Hadoop or python [NOTE: I have no python or mapreduce/hadoop experience].

I'm running a 64x bit version of R on a linux cluster which has up to 200Gb of available memory.

I have 22 samples/data.frames that each have 2-columns ("ensembl_gene_id" and "FPKM" values) with ~60,000 rows that I need to merge together. So the desired data.frame will have one column of ensembl_gene_id and then 22 columns of FPKM values.

In some samples there are genes missing so I cannot simply sort the data frames and join them line-by-line. So, my idea was to use the plyr package's join() function in an iterative fashion; however, I'm now questioning whether this is the ideal way to go about solving this problem due to the memory required.

I've provided 3 data.frames and the code I've tried. df1, df2, and df3 (I'm not reputable enough to post more than two links so I'm going to put it in the comment below)

library(plyr)

df1 <- read.csv("HAP1_dBet6_1h_1.csv")
df2 <- read.csv("HAP1_dBet6_1h_2.csv")
df3 <- read.csv("HAP1_dBet6_6h_1.csv")

HAP1 <- join(df1, df2, by="ensembl_gene_name")
Error: cannot allocate vector of size 1.5 Gb

The point about the failing to allocate vector is to highlight two points:

1) R give's very confusing errors. If I open another terminal window and check what R is actually using with the top command I could see that I was exceeding the default 8Gb I had access to! Therefore, I used salloc to request 80 Gb and tried the code below. 2) The second point is to demonstrate how the memory requirements swell as I do the plyr join() command in an iterative fashion.

HAP2 <- join(HAP1, df3, by="ensembl_gene_name")
Error: cannot allocate vector of size 34282.1 Gb

Theoretically, even if I had access to Tb's of memory by the time I do more joins I think this vector would increase to make this problem non-tractable. For example,

HAP3<-join(HAP2, df4, by="ensembl_gene_id")
HAP4<-join(HAP3, df5, by="ensembl_gene_id")
etc.

Can anyone suggest what I should try next?

I noticed there a lot of genes without ensembl id's (~20,000 NA's) in each data.frame. These are essentially meaningless so if I removed them prior to joining that may help. Another thing I though of was using the garbage cleaning function gc() in-between plyr joins. However, I'm still skeptical if this will help.

##########################WORKING UPDATE

Since this problem was solved by friendly users and I only put minimal code in the original question I thought I would write my workflow to help others in the future

library(plyr)
library(dplyr)
library(data.table)

# Load CSV files
df1 <- read.csv("HAP1_dBet6_1h_1.csv")
df2 <- read.csv("HAP1_dBet6_1h_2.csv")
df3 <- read.csv("HAP1_dBet6_6h_1.csv")
df4 <- read.csv("HAP1_dBet6_6h_2.csv")
...
df28 <- read.csv("HAP1_DMSO_6h_2.csv")

# Select only gene column and FPKM columns 
df1 <- select(df1,ensembl_gene_name,FPKM)
...

# Rename FPKM column to something meaningful (i.e. sample name) and omit NA's
df1 <- plyr::rename(df1,c("FPKM"="HAP1_dBet6_1h_1.csv"))
df1 <- na.omit(df1)
...

# Select unique gene name values and summarize FPKM counts
df1 <- df1 %>% group_by(ensembl_gene_name) %>% summarise(sum(K562_dBet6_1h_1))
...

df1 <- data.table(df1) # Turn data.frame into a data.table
setkey(df1, "ensembl_gene_name") # setkey() sorts a data.table and marks it as sorted
...

# Join data.tables
HAP1 <- merge(df1, df2, "ensembl_gene_name")
HAP2 <- merge(HAP1, df3, "ensembl_gene_name")
HAP3 <- merge(HAP2, df4, "ensembl_gene_name")
...
HAP27 <- merge(HAP26, df28, "ensembl_gene_name")

Thanks!

Upvotes: 1

Answers (4)

Paul Lemmens

Reputation: 635

I think the main issue/difficulty with the data is that both/all data frames contain duplicate key values (that data.table then also complains about saying it needs a cartesian product). So let's say df1 is something like

A 1
B 2
C 3
B 3
A 4

and df2 is

A 10
B 20
C 30
B 30
A 50

then the merge will become

A 1 ?
A 4 ?
A ? 10
A ? 50
# etc.

Where ? is my representation for NA that R will put there. This is what data.table complains about (and note that the by = .EACHI that it suggests does not work because the nrow(df1) != nrow(df2))

So you can do the maths how quickly this explodes with 66k rows having (i checked the three examples) approx. half of those duplicates. So you have to decide how to handle the duplicate keys. If the question marks in your end data frame are ok, then you can add allow.cartesian = TRUE to the data.table::merge() as suggested in my comment on PoGibas answer.

If you do worry about the ?'s, then I think you could try and find the largest set of gene names that exist over all 22 files that you have and then manually loop over them and in this way constructing your final data frame. That will lose you data though.

Perhaps you could try and first concatenate all files into one long df (22 x 66k rows) with an additional column for the FPKM label and then using tidyr::spread() to spread it (possibly grouping by gene and doing the spread() in a do() loop for memory reasons)???

# pseudocode
myfiles <- as.data.frame(f = list.files(...))
hap <- myfiles %>% group_by(f) %>%
  do({
    d <- read.csv(.$f, ...)
    d$fpkm <- names(d)[2]
  }) %>%
  group_by(gene) %>% 
  do({
    d <- spread(., ...) #something with .$fpkm, and the other column
  })

There's probably a data.table way of doing this that's probably going to be faster.

Upvotes: 0

pogibas

Reputation: 28369

Can you test this data.table solution:

library(data.table)
df1 <- fread("HAP1_dBet6_1h_1.csv")
df2 <- fread("HAP1_dBet6_1h_2.csv")
setkey(df1, "ensembl_gene_name")
setkey(df2, "ensembl_gene_name")
HAP1 <- merge(df1, df2, "ensembl_gene_name")

When dealing with larger data sets I prefer data.table objects with a setkey option.

Edit:

If fread is converting your gene names to something else, then you can use this:

library(data.table)
df1 <- read.csv("HAP1_dBet6_1h_1.csv")
df2 <- read.csv("HAP1_dBet6_1h_2.csv")
setDT(df1)
setDT(df2)
setkey(df1, "ensembl_gene_name")
setkey(df2, "ensembl_gene_name")
HAP1 <- merge(df1, df2, "ensembl_gene_name")

Upvotes: 1

CPak

Reputation: 13581

You should definitely consider removing NAs from your data.frame. You're likely duplicating each NA * the number of NA in your right-hand data.frame. Look at the output of nrow(HAP1) or str(HAP1). Does it contain more many more rows than either df1 or df2? A 60,000 x 22 data frame is not that large, but if your data frame expands after each join, it'll grow exponentially.

Upvotes: 1

user3766450

Reputation: 31

I cannot try things on your data, but I do have a suggestion. Convert to matrices and use rbind.fill, possibly transposing the matrices before and after.

Good luck!

Upvotes: 0

64x R in linux cluster with 200Gb: plyr package not enough to join tables

Answers (4)

Related Questions