Reputation: 3973
Assume I have a very large SQL database and I want to create a density plot for values in a float/numeric column. I could import all the values in R and plot the density easily. However, because the table has so many rows, I want to avoid importing all the values, but calculate summary statistics (e.g. min, q25, median, q75, max) at the SQL server with a query and only import those few summary values, leading to much less traffic.
So, in R I would get data like the agg
object:
require(data.table)
# test data
set.seed(1234)
vec = rgamma(1000, shape = 0.75)
dt = data.table(group = 'A',
val = vec)
agg = dt[ ,
.(min = min(val),
q25 = quantile(val, .25),
med = median(val),
q75 = quantile(val, .75),
max = max(val)) ]
How can I construct a density or violin plot from such aggregated data using the {ggplot2}
package? I assume geom_density()
calculates such values under the hood anyway, but I can't directly feed them into the function.
I would like to retrieve such a plot, by only using the agg
object:
require(ggplot2)
ggplot(dt) +
geom_density(aes(x = val))
Side note: Another way would be to import only a sample of the values.
Upvotes: 1
Views: 796
Reputation: 93811
You could group and tally the data with sufficiently small bins on the database side, export the (much smaller) aggregated data and then reconstruct the density plot. "Sufficiently small" here means small enough that the difference between the actual distribution of the data and the estimated distribution from the aggregated data is small enough not to matter for whatever your goals are.
You may need to experiment a bit, as it depends on the scale over which the density varies. In the example below, I used 10,000 bins, which is still a relatively small data set to export from the database. Note I've expanded your vec
sample data to 1 million values:
require(tidyverse)
require(patchwork) # Needed only for laying out the two plots
require(data.table)
# test data
set.seed(1234)
vec = rgamma(1e6, shape = 0.75)
dt = data.table(group = 'A',
val = vec)
agg = dt[ ,
.(min = min(val),
q25 = quantile(val, .25),
med = median(val),
q75 = quantile(val, .75),
max = max(val)) ]
# Set breaks for aggregrating
br = seq(min(vec), max(vec), length.out=10000)
dx = median(diff(br))
# Aggregate data
agg2 = dt %>%
count(val = cut(val, breaks=br, labels=(br + dx)[-length(br)],
include.lowest=TRUE),
val = as.numeric(as.character(val)))
p = ggplot(data=dt, aes(val)) +
geom_density(size=1, colour="blue", n=2^11) +
geom_density(data=agg2 %>% uncount(weight=n), colour="orange", n=2^11) +
theme_bw()
p + {p + coord_cartesian(xlim=c(0,0.3))}
The left-hand plot compares the density plot of the actual data to the density reconstructed from the data aggregated into 10,000 bins. The right-hand plot zooms in on x-axis to just the area of the thin peak near zero. Note the excellent overlap.
Upvotes: 2