Reputation: 37
For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.
The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.
An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist
.
i.e. For a variable that is gamma distributed:
vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)
fitdist(vector1, "gamma")
I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.
Upvotes: 2
Views: 775
Reputation: 523
One common practice is to split the data using split
and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.
df <- data.frame(
group = sample(1:2, 64, TRUE),
setting = sample(1:2, 64, TRUE),
diagnosis = sample(1:2, 64, TRUE),
stay.length = sample(1:5, 64, TRUE)
)
> head(df)
group setting diagnosis var
1 1 1 1 4
2 1 1 2 5
3 1 1 2 4
4 2 1 2 3
5 1 2 2 3
6 1 1 2 5
Perform split
and you will get a splitted List
:
dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))
> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1
$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1
$`1.2.1`
[1] 4 2 5 4 5 3 5 3
$`2.2.1`
[1] 2 1 4 3 5 4 4
$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5
$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2
Afterwards, we can use lapply
to perform whatever function on each group in the list. For example we can apply mean
dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222
.
.
.
.
$`2.2.2`
[1] 2.8
In your case, you can apply fitdist
or any other function.
dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))
> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
shape 3.38170 2.2831073
rate 1.04056 0.7573495
.
.
.
$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
estimate Std. Error
shape 4.868843 2.5184018
rate 1.549188 0.8441106
Upvotes: 0
Reputation: 1955
OK, your example isn't quite reproducible here, but I think the answer you want will something like the following:
result <- los_data %>%
group_by(group, setting, diagnosis) %>%
do({
fit <- fitdist(.$my_column, "gamma")
data_frame(group=.$group[1], setting=.$setting[1], diagnosis=.$diagnosis[1], fit = list(fit))
}) %>%
ungroup()
This will give you a data frame of all the fits, with columns for group, setting, diagnosis as well as a list-column which contains the fits for each one. Since it is a list column, you will need to use double brackets to extract individual fits. Example:
# Get the fit in the first row
result$fit[[1]]
Upvotes: 0