Sarah Roberts
Sarah Roberts

Reputation: 37

Splitting data and fitting distributions efficiently

For a project I have received a large amount of confidential patient level data that I need to fit a distribution to so as to use it in a simulation model. I am using R.

The problem is that I need is to fit the distribution to get the shape/rate data for at least 288 separate distributions (at least 48 subsets of 6 variables). The process will vary slightly between variables (depending on how that variable is distributed) but I want to be able to set up a function or loop for each variable and generate the shape and rate data for each subset I define.

An example of this: I need to find length of stay data for subsets of patients. There are 48 subsets of patients. The way I have currently been doing this is by manually filtering the data and then extracting those to vectors, and then fitting the data to the vector using fitdist.

i.e. For a variable that is gamma distributed:

vector1 <- los_data %>%
filter(group == 1, setting == 1, diagnosis == 1)

fitdist(vector1, "gamma")

I am quite new to data science and data processing, and I know there must be a simpler way to do this than by hand! I'm assuming something to do with a matrix, but I am absolutely clueless about how best to proceed.

Upvotes: 2

Views: 775

Answers (2)

see-king_of_knowledge
see-king_of_knowledge

Reputation: 523

One common practice is to split the data using split and then apply the function of interest on that group. Let's assume here we have four columns, group, settings, diagnosis and stay.length. The first three have two levels.

df <- data.frame(
  group = sample(1:2, 64, TRUE),
  setting  = sample(1:2, 64, TRUE),
  diagnosis  = sample(1:2, 64, TRUE), 
  stay.length = sample(1:5, 64, TRUE)
)
> head(df)
    group setting diagnosis var
1     1       1         1   4
2     1       1         2   5
3     1       1         2   4
4     2       1         2   3
5     1       2         2   3
6     1       1         2   5

Perform split and you will get a splitted List :

dfl <- split(df$stay.length, list(df$group, df$setting, df$diagnosis))

> head(dfl)
$`1.1.1`
[1] 5 3 4 1 4 5 4 2 1

$`2.1.1`
[1] 5 4 5 4 3 1 5 3 1

$`1.2.1`
[1] 4 2 5 4 5 3 5 3

$`2.2.1`
[1] 2 1 4 3 5 4 4

$`1.1.2`
[1] 5 4 4 4 3 2 4 4 5 1 5 5

$`2.1.2`
[1] 5 4 4 5 3 2 4 5 1 2    

Afterwards, we can use lapply to perform whatever function on each group in the list. For example we can apply mean

dflm <- lapply(dfl, mean)
> dflm
$`1.1.1`
[1] 3.222222

.
.
.
.

$`2.2.2`
[1] 2.8

In your case, you can apply fitdist or any other function.

dfl.fitdist <- lapply(dfl, function(x) fitdist(x, "gamma"))

> dfl
$`1.1.1`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape  3.38170  2.2831073
rate   1.04056  0.7573495

.
.
.


$`2.2.2`
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
  estimate Std. Error
shape 4.868843  2.5184018
rate  1.549188  0.8441106

Upvotes: 0

jrdnmdhl
jrdnmdhl

Reputation: 1955

OK, your example isn't quite reproducible here, but I think the answer you want will something like the following:

result <- los_data %>%
group_by(group, setting, diagnosis) %>%
do({
  fit <- fitdist(.$my_column, "gamma")
  data_frame(group=.$group[1], setting=.$setting[1], diagnosis=.$diagnosis[1], fit = list(fit))
}) %>%
ungroup()

This will give you a data frame of all the fits, with columns for group, setting, diagnosis as well as a list-column which contains the fits for each one. Since it is a list column, you will need to use double brackets to extract individual fits. Example:

# Get the fit in the first row
result$fit[[1]]

Upvotes: 0

Related Questions