DeltaIV
DeltaIV

Reputation: 5646

Create multiple histograms in a plot starting from bins and frequencies, instead than from samples?

I have a dataframe of size 10^6x3, i.e., 1 million samples for three variables. I would like to create three histograms in the same plot with overlay (alpha blending?) using R. The problem is that managing that many samples on my pc is possible (they fit in memory and R doesn't hang up forever), but not lightning fast. The code that generated the samples also gives me back lower and upper bin boundaries, and corresponding frequencies. Of course, this is much less data: I can choose the number of bins, but let's say 30 bins for variables, so 30x2x3=180 doubles. Is there a way in R to create overlayed histograms starting from bins and frequencies data? I would like to use ggplot2, but I'm open to solutions with base R or other packages. Also, what would you do in my situation? Would you use the original samples, and don't care about the longer computational time/memory occupation? Or would you go for bin/freqs? I'd like to use the raw data, but I'm worried that R could get too slow or hog too much memory, and that this could create issues in following computations. Thus a solution using raw data but optimized for speed/memory would be great, otherwise it's ok to use bin/freqs (if at all possible!).

Upvotes: 0

Views: 1058

Answers (2)

jlhoward
jlhoward

Reputation: 59395

I was curious about "not lightning fast". The dataset below (1e6 cases X 3 variables) renders in ~6 sec on my machine (Core i7, Win7 x64). Is that too slow?

set.seed(1)    # for reproducible example
df <- data.frame(matrix(rnorm(3e6, mean=rep(c(0,3,6), each=1e6)), ncol=3))
names(df) <- c("A","B","C")

library(ggplot2)
library(reshape2)
gg.df <- melt(df, variable.name="category")

system.time({
  ggp <- ggplot(gg.df, aes(x=value, fill=category)) + 
    stat_bin(geom="bar", position="identity", alpha=0.7)
  plot(ggp)
})
#    user  system elapsed 
#    5.68    0.53    6.24 

Upvotes: 1

Axeman
Axeman

Reputation: 35397

Yes, of course you can! Using the bins and frequencies you can make a bar graph.

dat <- data.frame(group = rep(c('a', 'b'), each = 10),
                  bin = rep(1:10, 2),
                  frequency = rnorm(20, 5))
library(ggplot2)

Using alpha blending as you suggested:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = position_identity(), alpha = 0.4)

plot1

Or we dodge the bars:

ggplot(dat, aes(x = bin, y = frequency, fill = group)) + 
  geom_bar(stat = 'identity', position = 'dodge')

enter image description here

Upvotes: 1

Related Questions