Reputation: 795
I would like to fit a distribution to multiple subsets of a large dataframe. The subsets would be based on each year
and the distribution would be fit to freq
.
An example dataframe:
df<- data.frame(year=c(rep(1998, 15), rep(1999, 16)),freq=c(103, 115, 13, 2, 67, 36, 51, 8, 6, 61, 10, 21,
7, 65, 4, 49, 92, 37, 16, 6, 23, 9, 2, 6, 5, 4,1, 3, 1, 9, 2))
I have tried the following to get an output of the coefficients (alpha parameter) of the fitted distribution along with associated statistics.
library(sads)
coef_vec<- NA
for (i in 1: length(unique(df$year))){
fit<- fitsad(df$freq[i], sad="ls")
coef_vec[i,] <- as.vector(t(do.call(rbind, coef(summary(coeff)))
[,1:2]))
}
I wish for the output to look like below:
output<- data.frame(para=rep(c("Estimate", "Std.Errror", "z value",
"Pr(z)"),2),year=
c(rep(1998,4),rep(1999,4)),value=c(3.7439,2.2216,1.6852,0.09195,2.8246, 1.8690,1.5113,0.1307))
You will notice that the alpha parameter and statistics are reported for each year. I modified this code from another I had found, but it isn't working.
Upvotes: 0
Views: 136
Reputation: 13108
We'll use the split-apply-combine strategy to deal with this problem.
First, we split the data into subsets:
library(sads) # Be sure to specify what package you're using in your question
by_year <- split(df$freq, df$year)
Then, we iterate through the subsets, applying a function to each subset that creates a dataframe with your desired output. (Here we're actually iterating through the index of each subset, i.e. 1, 2, ..., n, because this allows us to get the name of each subset, in this case the year).
out <- lapply(seq_along(by_year), function(i) {
fitted <- fitsad(by_year[[i]], sad = "ls")
coefs <- coef(summary(fitted))
df <- data.frame(param = colnames(coefs),
year = names(by_year)[i],
value = as.vector(coefs))
df
})
Finally, we combine the output into one dataframe:
data.frame(do.call(rbind, out), row.names = NULL)
# param year value
# 1 Estimate 1998 2.82461397
# 2 Std. Error 1998 1.86900479
# 3 z value 1998 1.51129307
# 4 Pr(z) 1998 0.13071380
# 5 Estimate 1999 3.74388575
# 6 Std. Error 1999 2.22161670
# 7 z value 1999 1.68520778
# 8 Pr(z) 1999 0.09194849
The tidyverse
approach to split-apply-combine:
library(dplyr)
library(tidyr)
library(purrr)
fit <- function(x) {
values <- coef(summary(fitsad(x$freq, sad = "ls")))
data.frame(param = colnames(coefs), value = as.vector(values))
}
df %>%
group_by(year) %>%
nest(freq) %>%
mutate(values = map(data, fit)) %>%
select(year, values) %>%
unnest()
# # A tibble: 8 x 3
# year param value
# <dbl> <fct> <dbl>
# 1 1998 Estimate 2.82
# 2 1998 Std. Error 1.87
# 3 1998 z value 1.51
# 4 1998 Pr(z) 0.131
# 5 1999 Estimate 3.74
# 6 1999 Std. Error 2.22
# 7 1999 z value 1.69
# 8 1999 Pr(z) 0.0919
Upvotes: 2