Splitting data and fitting distributions efficiently

Question

For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.

The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.

An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist.

i.e. For a variable that is gamma distributed:

vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)

fitdist(vector1, "gamma")

I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.

see-king_of_knowledge · Accepted Answer

One common practice is to split the data using split and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.

df <- data.frame(
  group = sample(1:2, 64, TRUE),
  setting  = sample(1:2, 64, TRUE),
  diagnosis  = sample(1:2, 64, TRUE), 
  stay.length = sample(1:5, 64, TRUE)
)
> head(df)
    group setting diagnosis var
1     1       1         1   4
2     1       1         2   5
3     1       1         2   4
4     2       1         2   3
5     1       2         2   3
6     1       1         2   5

Perform split and you will get a splitted List :

dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))

> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1

$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1

$`1.2.1`
[1] 4 2 5 4 5 3 5 3

$`2.2.1`
[1] 2 1 4 3 5 4 4

$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5

$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2

Afterwards, we can use lapply to perform whatever function on each group in the list. For example we can apply mean

dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222

.
.
.
.

$`2.2.2`
[1] 2.8

In your case, you can apply fitdist or any other function.

dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))

> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape  3.38170  2.2831073
rate   1.04056  0.7573495

.
.
.


$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape 4.868843  2.5184018
rate  1.549188  0.8441106

Splitting data and fitting distributions efficiently

Answers (2)

Related Questions