Reputation: 5646
I have a dataframe of size 10^6x3, i.e., 1 million samples for three variables. I would like to create three histograms in the same plot with overlay (alpha blending?) using R. The problem is that managing that many samples on my pc is possible (they fit in memory and R doesn't hang up forever), but not lightning fast. The code that generated the samples also gives me back lower and upper bin boundaries, and corresponding frequencies. Of course, this is much less data: I can choose the number of bins, but let's say 30 bins for variables, so 30x2x3=180 doubles. Is there a way in R to create overlayed histograms starting from bins and frequencies data? I would like to use ggplot2
, but I'm open to solutions with base R or other packages. Also, what would you do in my situation? Would you use the original samples, and don't care about the longer computational time/memory occupation? Or would you go for bin/freqs? I'd like to use the raw data, but I'm worried that R could get too slow or hog too much memory, and that this could create issues in following computations. Thus a solution using raw data but optimized for speed/memory would be great, otherwise it's ok to use bin/freqs (if at all possible!).
Upvotes: 0
Views: 1058
Reputation: 59395
I was curious about "not lightning fast". The dataset below (1e6 cases X 3 variables) renders in ~6 sec on my machine (Core i7, Win7 x64). Is that too slow?
set.seed(1) # for reproducible example
df <- data.frame(matrix(rnorm(3e6, mean=rep(c(0,3,6), each=1e6)), ncol=3))
names(df) <- c("A","B","C")
library(ggplot2)
library(reshape2)
gg.df <- melt(df, variable.name="category")
system.time({
ggp <- ggplot(gg.df, aes(x=value, fill=category)) +
stat_bin(geom="bar", position="identity", alpha=0.7)
plot(ggp)
})
# user system elapsed
# 5.68 0.53 6.24
Upvotes: 1
Reputation: 35397
Yes, of course you can! Using the bins and frequencies you can make a bar graph.
dat <- data.frame(group = rep(c('a', 'b'), each = 10),
bin = rep(1:10, 2),
frequency = rnorm(20, 5))
library(ggplot2)
Using alpha blending as you suggested:
ggplot(dat, aes(x = bin, y = frequency, fill = group)) +
geom_bar(stat = 'identity', position = position_identity(), alpha = 0.4)
Or we dodge the bars:
ggplot(dat, aes(x = bin, y = frequency, fill = group)) +
geom_bar(stat = 'identity', position = 'dodge')
Upvotes: 1