NomNonYon
NomNonYon

Reputation: 87

Best way to calculate quartile ranges of population ages and store results as a data frame in R

I have a data frame with population data each year for a number of years. I want to get the median, Q1 and Q3 age for each year and store the results as a new data frame. What is the best way to approach this? Here is a sample of my data. Each year column contains the number of people of the corresponding age:

Age | 2000 | 2001 | 2002 
------------------------
2   | 4    | 1    | 2   
3   | 6    | 3    | 5  
4   | 10   | 9    | 8 
5   | 8    | 9    | 8
6   | 7    | 7    | 8

Edit: Just to add a little more detail. To get the median and quartile ranges I most likely need to get the cumulative frequencies for each year. I could create a data frame with all the CF's and join on the age column and based off that create a new data frame with the median and q range ages but I am sure there is a more efficient way of doing it.

Upvotes: 0

Views: 970

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21918

Ok based on your description I updated my code. First I incremented the age distribution by every year and then calculated the required statistics:

library(dplyr)
library(tidyr)

df <- tribble(
  ~ Age,  ~`2000`,  ~`2001`,  ~`2002`, 
  
  2,    4,     1,     2,   
  3,    6,     3,     5,  
  4,    10,    9,     8, 
  5,    8,     9,     8,
  6,    7,     7,     8
)



df %>%
  rowwise() %>%
  mutate(across(`2000`:`2002`, ~ list(rep(Age, .x)))) %>%
  pivot_longer(- Age, names_to = "years", values_to = "values") %>%
  unnest(values) %>% 
  select(-Age) %>%
  group_by(years) %>%
  summarise(medians = median(values),
            quantiles = list(quantile(values, probs = c(0.25, 0.75)))) %>%
  unnest_wider(quantiles)

# A tibble: 3 x 4
  years medians `25%` `75%`
  <chr>   <dbl> <dbl> <dbl>
1 2000        4     3   5  
2 2001        5     4   5  
3 2002        5     4   5.5

Please let me know if you need further explanations.

Upvotes: 1

bouncyball
bouncyball

Reputation: 10761

There's a really helpful function called uncount in the tidyr package that we can use. First, we use pivot_longer to move the year columns to be rows. Then, we use uncount so that each age shows up as many times as it occurs. Then, group_by year and calculate the summary statistics using summarise.

library(tidyverse)

dat %>%
    pivot_longer(-Age, 
                 names_to = "year", 
                 names_prefix = "X", 
                 values_to = "cnt") %>%
    uncount(cnt) %>%
    group_by(year) %>%
    summarise(q25 = quantile(Age, .25),
              q50 = median(Age),
              q75 = quantile(Age, .75))

#   year    q25   q50   q75
#   <chr> <dbl> <int> <dbl>
# 1 2000      3     4   5  
# 2 2001      4     5   5  
# 3 2002      4     5   5.5

And here's a base R solution using a similar idea with the rep function:

apply(dat[,-1], 2, 
      FUN = function(x){
          rep_age <- rep(dat$Age, x)
          c(quantile(rep_age, .25), 
            quantile(rep_age, .5), 
            quantile(rep_age, .75))
          })

#     X2000 X2001 X2002
# 25%     3     4   4.0
# 50%     4     5   5.0
# 75%     5     5   5.5

Data

dat <- structure(list(Age = 2:6, 
                      X2000 = c(4L, 6L, 10L, 8L, 7L), 
                      X2001 = c(1L, 3L, 9L, 9L, 7L), 
                      X2002 = c(2L, 5L, 8L, 8L, 8L)), 
                 class = "data.frame", 
                 row.names = c(NA, -5L))

Upvotes: 3

Related Questions