WellWellWell
WellWellWell

Reputation: 21

Constructing dirichlet multinomial from independant negative binomials

I have a dataset with ISO3 country code, Year (from 2010 to 2019), Causes of death category (24 categories in total) and number of deaths. The sum of deaths across causes is fixed. My goal is to predict the expected number of deaths for each country/cause combination for the period 2020-2023.

I am not allowed to share the data but here is a sample of data with only one single country and 4 causes of death categories :

# A tibble: 40 × 3
    Year Cause                      Deaths
   <dbl> <chr>                       <dbl>
 1  2010 Communicable diseaes       97203.
 2  2010 Non-communicable diseases   2743.
 3  2010 Injuries                    7128.
 4  2010 Other                      12712.
 5  2011 Communicable diseaes       96536.
 6  2011 Non-communicable diseases   2716.
 7  2011 Injuries                    7264.
 8  2011 Other                      13088.
 9  2012 Communicable diseaes       95786.
10  2012 Non-communicable diseases   2642.
11  2012 Injuries                    7140.
12  2012 Other                      13556.
13  2013 Communicable diseaes       94153.
14  2013 Non-communicable diseases   2652.
15  2013 Injuries                    7191.
16  2013 Other                      14058.
17  2014 Communicable diseaes       92192.
18  2014 Non-communicable diseases   2828.
19  2014 Injuries                    7288.
20  2014 Other                      14825.
21  2015 Communicable diseaes       96254.
22  2015 Non-communicable diseases   2617.
23  2015 Injuries                    8128.
24  2015 Other                      16339.
25  2016 Communicable diseaes      101976.
26  2016 Non-communicable diseases   2470.
27  2016 Injuries                    8366.
28  2016 Other                      17007.
29  2017 Communicable diseaes       97219 
30  2017 Non-communicable diseases   2365 
31  2017 Injuries                    7675 
32  2017 Other                      15686 
33  2018 Communicable diseaes       95826 
34  2018 Non-communicable diseases   2180 
35  2018 Injuries                    7937 
36  2018 Other                      15807 
37  2019 Communicable diseaes       97264 
38  2019 Non-communicable diseases   2080 
39  2019 Injuries                    8031 
40  2019 Other                      15660  

My idea was to run independent negative binomial regression models for each country/cause combination using gam function in R because there is clearly overdispersion in my data. Here is a example of code for the data sample (e.g. only one single country):

# original dataset
df <- df %>%
  mutate(expected_cause = NA,
         expected_cause_se = NA) %>%
  arrange(Year)

# create a dataset to stored predictions for 2020-2023
pred_pand <- expand.grid(Year = rep(c(2020, 2021,2022,2023), each = 1),
                               Cause = rep(unique(df$Cause))) %>%
  mutate(expected_cause = NA,
         expected_cause_se = NA) %>%
  arrange(Year)

for (i in unique(df$Cause)) {
  # create temp dataset
  whichs <- which(df$Cause == i)
  temp <- df[whichs, ]

  # run the model
  model <- gam(Deaths ~ s(Year), data = temp, family = nb(theta = NULL, link = "log"))
  pred <- predict(model, se.fit = TRUE, type = "response", newdata = 
  data.frame(Year = c(2020,2021,2022,2023)))

  # store predictions for 2020-2023
  whichs_pred <- which(pred_pand$Cause == i)
  pred_pand[whichs_pred, "expected_cause"] <- pred$fit
  pred_pand[whichs_pred, "expected_cause_se"] <- pred$se.fit
  
  # store predictions for years prior to 2019
  years <- unique(temp$Year)
  range_years <- min(years):max(years)
  pred_hist <- predict(model, se.fit = TRUE, type = "response",
                       newdata = data.frame(Year = range_years))
  df[whichs, "expected_cause"] <- pred_hist$fit
  df[whichs, "expected_cause_se"] <- pred_hist$se.fit

}

To make sure that the sum of deaths across causes add up to the known and fixed total, it seems I need to construct a Dirichlet multinomial from those independent negative binomials. See this reference: https://arxiv.org/pdf/2001.04343 "A Dirichlet-multinomial distribution is equivalent to a collection of independent negative binomial distributions with the same scale parameter conditioned on their sum". Is it the right way to go?

If so, I am not sure what I need to do in R to apply this. I tried different theta values in the code below but the sum of deaths across causes does not add up to the fixed Total + setting a defined theta value affects the model fit.

model <- gam(Deaths ~ s(Year), data = temp, family = nb(theta = **NULL**, link = "log"))

What should I do? If I manually rescale the expected number of deaths by causes so that the sum across causes matches the fixed total, it affects the model fit. If I don't rescale, the sum across causes does not match the fixed total.

Upvotes: 1

Views: 75

Answers (1)

Severin Pappadeux
Severin Pappadeux

Reputation: 20120

I think approach mentioned in the arxiv paper is just plain wrong. For sum of the negative binomial author state that sum M is distributed as negative binomial as well.

M ~ NB(a0, theta), where a0=Sumi ai

This is why it doesn't work for you, M is not a fixed value, and only by some coincidence its sampling it could be equal to value from your data.

Dirichlet-Multinomial condition and distribution shall be imposed right from the start, not trying to bolt it on later.

Not sure about cookie-cutter solution, but I would start with Dirichlet-Multinomial from Bioconductor https://bioconductor.org/packages/release/bioc/html/DirichletMultinomial.html

Looks like reasonable package, don't need to sample number of Dirichlet components, you already know that. Might require to dig inside to provide penalties etc

Upvotes: 1

Related Questions