robertdj
robertdj

Reputation: 1117

Evaluate dnorm for multiple parameter values and the same argument

I am trying to accomplish the same as in this post, namely overlaying multiple histrograms with densities. The solution in the referred post works, but it made me wonder if calculation of the dfn can be done with newer packages like purrr/purrrlyr:

set.seed(1)
df <- data.frame(bsa=rnorm(200, mean=rep(c(1,4),each=100)), 
                 group=rep(c("test","control"),each=100))

stats <- df %>% group_by(group) %>% summarise(m = mean(bsa), sd = sd(bsa))
x <- with(df, seq(min(bsa), max(bsa), len=100))

dfn <- do.call(rbind,lapply(1:nrow(stats), 
                            function(i) with(stats[i,],data.frame(group, x, y=dnorm(x,mean=m,sd=sd)))))

To perform the inner lapply part, I have been trying stuff along the lines of

stats %>%
    dplyr::group_by(group) %>%
    purrr::map(x, dnorm, m, sd)

That is, passing on m and sd from stats and using the same x. Unfortunately, it doesn't work. (Once the inner part is accomplished, I can pass on the result to do.call, so that is not a problem).

Upvotes: 0

Views: 534

Answers (2)

Brian
Brian

Reputation: 8275

Per @Aurele's request, here's my take:

library(dplyr)
library(tidyr)
library(ggplot2)

df <- data.frame(bsa=rnorm(200, mean=rep(c(1,4),each=100)), 
           group=rep(c("test","control"),each=100)) 

Option 1:

df %>% 
  group_by(group) %>% 
  summarise_all(funs(mean, sd, min, max)) %>% 
  group_by(group) %>%                                                        
  mutate(newdata = list(data_frame(x = seq(min, max, length.out = 80)))) %>%     
  unnest() %>%                                                             
  mutate(dens = dnorm(x, mean, sd)) %>% 
  ggplot() + 
  geom_histogram(data = df, aes(bsa, y = ..density.., fill = group), alpha = 0.5) +
  geom_line(aes(x, dens, color = group), size = 2)

Option 2:

df %>% 
  group_by(group) %>% 
  summarise_all(funs(mean, sd, min, max)) %>% 
  group_by(group, mean, sd, min, max) %>% 
  do(data_frame(x = seq(.$min, .$max, length.out = 80))) %>% 
  mutate(dens = dnorm(x, mean, sd)) %>% 
  ggplot() + 
  geom_histogram(data = df, aes(bsa, y = ..density.., fill = group), alpha = 0.5) +
  geom_line(aes(x, dens, color = group), size = 2)

My two methods are the same, just slightly different in generating the new data.

  1. Group by group
  2. Generate summary statistics
  3. Group by group again
  4. Create a new list column named newdata with a sequence of x values from min to max of bsa, then tidyr::unnest to expand it

OR

  1. Group by all the stats to keep them in the environment
  2. Generate a whole new dataframe with do for the new x values, each of which inherits the stats columns from the environment

THEN

  1. Generate the densities at each x with the appropriate mean and sd.
  2. Make a plot of the densities as lines
  3. Add a layer with a histogram of the original data (with y = ..density.. for scaling)

The only difference between my approach and Aurele's is that they generate a new x value for each row of your original data. If you have 50-100 data points, that's a good idea. If you have <20 data points, your density lines will be jumpy and not smooth. If you have >500 data points, you're wasting your time with overly high resolution that isn't needed and takes up memory. The default in ggplot2 for curve generation is usually 80 points, so that's what I used (length.out = 80 in both options).

enter image description here

Upvotes: 1

Aur&#232;le
Aur&#232;le

Reputation: 12819

If you go dplyr, I think you don't really need to compute stats nor x separately. I'd do:

dfn_2 <-
  df %>% 
  mutate_at(vars(bsa), funs(min, max)) %>% 
  arrange(group) %>% 
  group_by(group) %>% 
  transmute(
    x = seq(first(min), first(max), length.out = n()), 
    y = dnorm(x, mean(bsa), sd(bsa))
  ) %>% 
  as.data.frame()

all.equal(dfn, dfn_2)
# [1] TRUE

Alternatively, here are two approaches that I do not recommend. Just to demonstrate it is possible, and how you could have done what you were trying:

dfn_3 <-
  stats %>% 
  split(.$group) %>% 
  map2_df(names(.), ~ tibble(group = .y, x, y = dnorm(x, .x$m, .x$sd)))

# # A tibble: 200 x 3
#      group         x            y
#      <chr>     <dbl>        <dbl>
#  1 control -1.888921 6.490182e-09
#  2 control -1.809524 1.045097e-08
#  3 control -1.730128 1.672139e-08
#  4 control -1.650731 2.658301e-08
#  5 control -1.571334 4.199062e-08
#  6 control -1.491938 6.590471e-08
#  7 control -1.412541 1.027772e-07
#  8 control -1.333145 1.592550e-07
#  9 control -1.253748 2.451917e-07
# 10 control -1.174352 3.750891e-07
# # ... with 190 more rows

all.equal(dfn, as.data.frame(mutate_at(dfn_3, vars(group), as.factor)))
# [1] TRUE


dfn_4 <-
  stats %>% 
  group_by(group) %>% 
  transmute(x = list(x), y = map(x, dnorm, m, sd)) %>% 
  ungroup() %>% 
  tidyr::unnest()

# # A tibble: 200 x 3
#      group         x            y
#     <fctr>     <dbl>        <dbl>
#  1 control -1.888921 6.490182e-09
#  2 control -1.809524 1.045097e-08
#  3 control -1.730128 1.672139e-08
#  4 control -1.650731 2.658301e-08
#  5 control -1.571334 4.199062e-08
#  6 control -1.491938 6.590471e-08
#  7 control -1.412541 1.027772e-07
#  8 control -1.333145 1.592550e-07
#  9 control -1.253748 2.451917e-07
# 10 control -1.174352 3.750891e-07
# # ... with 190 more rows

all.equal(dfn, as.data.frame(dfn_4))
# [1] TRUE

Upvotes: 2

Related Questions