Reputation: 1117
I am trying to accomplish the same as in this post, namely overlaying multiple histrograms with densities.
The solution in the referred post works, but it made me wonder if calculation of the dfn
can be done with newer packages like purrr
/purrrlyr
:
set.seed(1)
df <- data.frame(bsa=rnorm(200, mean=rep(c(1,4),each=100)),
group=rep(c("test","control"),each=100))
stats <- df %>% group_by(group) %>% summarise(m = mean(bsa), sd = sd(bsa))
x <- with(df, seq(min(bsa), max(bsa), len=100))
dfn <- do.call(rbind,lapply(1:nrow(stats),
function(i) with(stats[i,],data.frame(group, x, y=dnorm(x,mean=m,sd=sd)))))
To perform the inner lapply
part, I have been trying stuff along the lines of
stats %>%
dplyr::group_by(group) %>%
purrr::map(x, dnorm, m, sd)
That is, passing on m
and sd
from stats and using the same x
. Unfortunately, it doesn't work.
(Once the inner part is accomplished, I can pass on the result to do.call
, so that is not a problem).
Upvotes: 0
Views: 534
Reputation: 8275
Per @Aurele's request, here's my take:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(bsa=rnorm(200, mean=rep(c(1,4),each=100)),
group=rep(c("test","control"),each=100))
df %>%
group_by(group) %>%
summarise_all(funs(mean, sd, min, max)) %>%
group_by(group) %>%
mutate(newdata = list(data_frame(x = seq(min, max, length.out = 80)))) %>%
unnest() %>%
mutate(dens = dnorm(x, mean, sd)) %>%
ggplot() +
geom_histogram(data = df, aes(bsa, y = ..density.., fill = group), alpha = 0.5) +
geom_line(aes(x, dens, color = group), size = 2)
df %>%
group_by(group) %>%
summarise_all(funs(mean, sd, min, max)) %>%
group_by(group, mean, sd, min, max) %>%
do(data_frame(x = seq(.$min, .$max, length.out = 80))) %>%
mutate(dens = dnorm(x, mean, sd)) %>%
ggplot() +
geom_histogram(data = df, aes(bsa, y = ..density.., fill = group), alpha = 0.5) +
geom_line(aes(x, dens, color = group), size = 2)
My two methods are the same, just slightly different in generating the new data.
list
column named newdata
with a sequence of x values from min to max of bsa, then tidyr::unnest
to expand itdo
for the new x values, each of which inherits the stats columns from the environmenty = ..density..
for scaling)The only difference between my approach and Aurele's is that they generate a new
x value for each row of your original data. If you have 50-100 data points, that's a good idea. If you have <20 data points, your density lines will be jumpy and not smooth. If you have >500 data points, you're wasting your time with overly high resolution that isn't needed and takes up memory. The default in ggplot2
for curve generation is usually 80 points, so that's what I used (length.out = 80
in both options).
Upvotes: 1
Reputation: 12819
If you go dplyr
, I think you don't really need to compute stats
nor x
separately. I'd do:
dfn_2 <-
df %>%
mutate_at(vars(bsa), funs(min, max)) %>%
arrange(group) %>%
group_by(group) %>%
transmute(
x = seq(first(min), first(max), length.out = n()),
y = dnorm(x, mean(bsa), sd(bsa))
) %>%
as.data.frame()
all.equal(dfn, dfn_2)
# [1] TRUE
Alternatively, here are two approaches that I do not recommend. Just to demonstrate it is possible, and how you could have done what you were trying:
dfn_3 <-
stats %>%
split(.$group) %>%
map2_df(names(.), ~ tibble(group = .y, x, y = dnorm(x, .x$m, .x$sd)))
# # A tibble: 200 x 3
# group x y
# <chr> <dbl> <dbl>
# 1 control -1.888921 6.490182e-09
# 2 control -1.809524 1.045097e-08
# 3 control -1.730128 1.672139e-08
# 4 control -1.650731 2.658301e-08
# 5 control -1.571334 4.199062e-08
# 6 control -1.491938 6.590471e-08
# 7 control -1.412541 1.027772e-07
# 8 control -1.333145 1.592550e-07
# 9 control -1.253748 2.451917e-07
# 10 control -1.174352 3.750891e-07
# # ... with 190 more rows
all.equal(dfn, as.data.frame(mutate_at(dfn_3, vars(group), as.factor)))
# [1] TRUE
dfn_4 <-
stats %>%
group_by(group) %>%
transmute(x = list(x), y = map(x, dnorm, m, sd)) %>%
ungroup() %>%
tidyr::unnest()
# # A tibble: 200 x 3
# group x y
# <fctr> <dbl> <dbl>
# 1 control -1.888921 6.490182e-09
# 2 control -1.809524 1.045097e-08
# 3 control -1.730128 1.672139e-08
# 4 control -1.650731 2.658301e-08
# 5 control -1.571334 4.199062e-08
# 6 control -1.491938 6.590471e-08
# 7 control -1.412541 1.027772e-07
# 8 control -1.333145 1.592550e-07
# 9 control -1.253748 2.451917e-07
# 10 control -1.174352 3.750891e-07
# # ... with 190 more rows
all.equal(dfn, as.data.frame(dfn_4))
# [1] TRUE
Upvotes: 2