Danielle
Danielle

Reputation: 795

Fit distribution to multiple subsets and extract parameter

I would like to fit a distribution to multiple subsets of a large dataframe. The subsets would be based on each year and the distribution would be fit to freq.

An example dataframe:

df<- data.frame(year=c(rep(1998, 15), rep(1999, 16)),freq=c(103, 115, 13, 2, 67, 36, 51, 8, 6, 61, 10, 21,
      7, 65, 4, 49, 92, 37, 16, 6, 23, 9, 2, 6, 5, 4,1, 3, 1, 9, 2))

I have tried the following to get an output of the coefficients (alpha parameter) of the fitted distribution along with associated statistics.

library(sads)

coef_vec<- NA

for (i in 1: length(unique(df$year))){ 
  fit<- fitsad(df$freq[i], sad="ls")
coef_vec[i,] <- as.vector(t(do.call(rbind, coef(summary(coeff))) 
[,1:2]))
} 

I wish for the output to look like below:

output<- data.frame(para=rep(c("Estimate", "Std.Errror", "z value", 
"Pr(z)"),2),year= 
c(rep(1998,4),rep(1999,4)),value=c(3.7439,2.2216,1.6852,0.09195,2.8246, 1.8690,1.5113,0.1307))

You will notice that the alpha parameter and statistics are reported for each year. I modified this code from another I had found, but it isn't working.

Upvotes: 0

Views: 136

Answers (1)

Weihuang Wong
Weihuang Wong

Reputation: 13108

We'll use the split-apply-combine strategy to deal with this problem.

First, we split the data into subsets:

library(sads) # Be sure to specify what package you're using in your question
by_year <- split(df$freq, df$year)

Then, we iterate through the subsets, applying a function to each subset that creates a dataframe with your desired output. (Here we're actually iterating through the index of each subset, i.e. 1, 2, ..., n, because this allows us to get the name of each subset, in this case the year).

out <- lapply(seq_along(by_year), function(i) {
  fitted <- fitsad(by_year[[i]], sad = "ls")
  coefs <- coef(summary(fitted))
  df <- data.frame(param = colnames(coefs),
                   year = names(by_year)[i],
                   value = as.vector(coefs))
  df
})

Finally, we combine the output into one dataframe:

data.frame(do.call(rbind, out), row.names = NULL)
#        param year      value
# 1   Estimate 1998 2.82461397
# 2 Std. Error 1998 1.86900479
# 3    z value 1998 1.51129307
# 4      Pr(z) 1998 0.13071380
# 5   Estimate 1999 3.74388575
# 6 Std. Error 1999 2.22161670
# 7    z value 1999 1.68520778
# 8      Pr(z) 1999 0.09194849

The tidyverse approach to split-apply-combine:

library(dplyr)
library(tidyr)
library(purrr)

fit <- function(x) {
  values <- coef(summary(fitsad(x$freq, sad = "ls")))
  data.frame(param = colnames(coefs), value = as.vector(values))
}

df %>%
  group_by(year) %>%
  nest(freq) %>%
  mutate(values = map(data, fit)) %>%
  select(year, values) %>%
  unnest()
# # A tibble: 8 x 3
#    year param       value
#   <dbl> <fct>       <dbl>
# 1  1998 Estimate   2.82  
# 2  1998 Std. Error 1.87  
# 3  1998 z value    1.51  
# 4  1998 Pr(z)      0.131 
# 5  1999 Estimate   3.74  
# 6  1999 Std. Error 2.22  
# 7  1999 z value    1.69  
# 8  1999 Pr(z)      0.0919

Upvotes: 2

Related Questions