Reputation: 99

Possible in R Tidyverse to aggregate across rows, combining different age typologies together, only keeping complete age range sets?

I have a tricky challenge on a project that I'm not sure quite how to approach or even if it is possible at all.

I have a dataset where the values for two metrics (MetricA and MetricB here) are scraped from government reports from a range of different countries. The data is in a Tidy (long) format and each row is an observation for one metric for one area for one age group.

I am reporting the age ranges for each metric for those less than 15 years old (and for those who are 15 years old and older. The tricky thing is that many countries, but not all, use that typology.

Those countries where the reported values can't be mapped to this typology are discarded and those countries where data is incomplete for all ages is discarded. So, for example, something which had values for "< 15 years old", "15 to 49", "50+") would be kept, while a different country which had "< 18" and "18-49", but that's it, would be discarded, since it didn't have the "50+" people.

So, the challenge is this, I need to be able to pick out only those countries where the rows for that country cover the full gamut from 0 to whatever (I've been using 199 years old as a hypothetical max age in this case), with no overlap between age categories, with a break in those categories so that the data can be aggregated to form both a LT15 (Less than 15 years old) and a GTE15 (Greater than or equal to 15 years old) category. These are marked by minimum_age and maximum_age in my dataset. The proxy value for maximum_age when the value is to infinity (e.g. "18+" or ">59 years old" is 199.

Ideally this would use Tidyverse, but any package that could make this work would be fine.

A reprex for my dataset is here. In this case the Portuguese data would be discarded because there was no information for the 51 and greater age category.

country <- c("USA", "USA", "USA", "USA","Bolivia", "Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal")
metric <- c("MetricA", "MetricA","MetricB", "MetricB","MetricA", "MetricA","MetricA", "MetricA", "MetricB", "MetricB")
metric_value <- c(12, 10,8, 23, 4, 23, 14, 12, 1, 23)
minimum_age <- c(0, 15, 0, 15, 0, 3, 15, 30, 0, 19)
maximum_age <- c(14, 199, 14, 199, 2, 14, 29, 199, 18, 50)

df <- data.frame(country, metric, value, minimum_age, maximum_age)

The output from this would be something like:

  country  metric value age_category
1     USA MetricA    12         LT15
2     USA MetricA    10        GTE15
3     USA MetricB     8         LT15
4     USA MetricB    23        GTE15
5 Bolivia MetricA    27         LT15
6 Bolivia MetricA    26        GTE15

This doesn't seem like it's possible to me, but I'm sure that I'm missing something here. Is this a challenge that's even doable?

Thanks in advance!

EDIT: Added an expected output block EDIT2: Changed the reprex so that the numeric values were being entered as numeric and not as character. Also changed the reprex so that the "value" column was more clearly referring to the metric.

Upvotes: 1

Answers (3)

G. Grothendieck

Reputation: 269694

is.continuous takes a vector of start ages and a vector of end ages and ensures that all ages between the smallest and largest are covered with no overlaps and no gaps returning TRUE if those are all satisfied and FALSE otherwise. For example

is.continuous(c(0, 5), c(4, 8)) 
## [1] TRUE
is.continuous(c(0, 3), c(4, 8)) # overlap so FALSE
## [1] FALSE
is.continuous(c(0, 6), c(4, 8)) # gap so FALSE
## [1] FALSE

Now convert the columns that should be numeric to numeric, filter the groups using is.continuous and the other conditions and then categorize the rows, sum and sort.

Regarding the other conditions the question variously refers to the requirement that the maximum be 50+ or that it be equal to 199 and since it was not clear which is the actual condition we used 50+ but we can replace max(maximum_age) >= 50 with max(maximum_age) == 199 depending on what is wanted.

# return true if every age from minimum to maximum occurs once
is.continuous <- function(st, en) {
  s <- do.call("c", Map(seq, st, en))
  rng <- range(s)
  all(table(c(s, rng[1]:rng[2])) == 2)
}

df %>%
  mutate(value = as.numeric(value), minimum_age = as.numeric(minimum_age), 
   maximum_age = as.numeric(maximum_age)) %>%
  group_by(country, metric) %>%
  filter(is.continuous(minimum_age, maximum_age) && 
    14 %in% maximum_age && 15 %in% minimum_age &&
    max(maximum_age) >= 50) %>%
  group_by(age_category = if_else(maximum_age < 15,"LT15", "GTE15"), .add=TRUE) %>%
  summarise(value = sum(value), .groups = "drop") %>%
  arrange(country, metric, desc(age_category))

giving:

# A tibble: 6 x 4
  country metric  age_category value
  <chr>   <chr>   <chr>        <dbl>
1 Bolivia MetricA LT15            27
2 Bolivia MetricA GTE15           26
3 USA     MetricA LT15            12
4 USA     MetricA GTE15           10
5 USA     MetricB LT15             8
6 USA     MetricB GTE15           23

Note

The input in the question was slightly changed but above we are using the original input which in reproducible form is:

df <- 
structure(list(country = c("USA", "USA", "USA", "USA", "Bolivia", 
"Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal"), metric = c("MetricA", 
"MetricA", "MetricB", "MetricB", "MetricA", "MetricA", "MetricA", 
"MetricA", "MetricB", "MetricB"), value = c("12", "10", "8", 
"23", "4", "23", "14", "12", "1", "23"), minimum_age = c("0", 
"15", "0", "15", "0", "3", "15", "30", "0", "19"), maximum_age = c("14", 
"199", "14", "199", "2", "14", "29", "199", "18", "50")), class = "data.frame", row.names = c(NA, 
-10L))

Upvotes: 1

CALUM Polwart

Reputation: 537

I'm not 100% certain the answer given works for a wider dataset. The example in the question included a 0-18age band and 18-50age band. But it could be 0-18 and 18+. 0-18 is still to be excluded I think...

I think this does, although it is likely more verbose than it needs to be

require(dplyr)
country <- c("USA", "USA", "USA", "USA","Bolivia", "Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal")
metric <- c("MetricA", "MetricA","MetricB", "MetricB","MetricA", "MetricA","MetricA", "MetricA", "MetricB", "MetricB")
value <- c("12", "10","8", "23", "4", "23", "14", "12", "1", "23")
minimum_age <- c("0", "15", "0", "15", "0", "3", "15", "30", "0", "19")
maximum_age <- c("14", "199", "14", "199", "2", "14", "29", "199", "18", "50")

df <- data.frame(country, metric, value, minimum_age, maximum_age)

keep_case <- function (LT15, LT_valid, GT_valid) {
    case_when(
    LT15 == T & LT_valid ==T ~ T,
    LT15 == F & GT_valid ==T ~ T,
)
}
df %>%
    # Make the numeric values numeric instead of characters
    mutate(
        maximum_age = as.numeric(maximum_age),
        minimum_age = as.numeric(minimum_age),
        value = as.numeric(value)
    ) %>%
    # track if the age range is less than 15 and / or greater than 15
    mutate(
        LT15 = if_else(maximum_age < 15, T,F),
        GTE15 = if_else(minimum_age >=15, T,F)
        ) %>%
    # group the results
    group_by(country, metric, LT15) %>%
    #summarise the results by groups
    summarise(
        value = sum(value),
        min_age = min(minimum_age),
        max_age = max(maximum_age)
    ) %>%
    # Check if Less that 15s cover whole age range
    # Check if over 15s cover whole age range
    mutate(LT_valid = if_else(min_age == 0 & max_age ==14, T, F),
           GT_valid = if_else(min_age ==15 & max_age ==199, T, F)) %>%
    # mark only valid cases to keep
    mutate(keep = keep_case(LT15, LT_valid, GT_valid)) %>%
    # filter off invalid cases
    filter(keep == T) %>%
    # remove the columns not required
    select(c(country, metric, value, age_category =LT15))%>%
    # rename the LT15 from T /F
    mutate(age_category = if_else(age_category==T, "LT15", "GTE15"))

Upvotes: 1

akrun

Reputation: 887231

Here is an option with dplyr. Convert the columns types (type.convert - as some of the numeric columns were created as character class - by quoting). Grouped by 'country' 'metric', filter out groups that doesn't have 'maximum_age' greater than 50, create a grouping column based on the values in 'minimum_age' and 'maximum_age' and summarise the 'value' column by taking the sum

library(dplyr)
df %>% 
   group_by(country, metric) %>%
   filter(min(minimum_age) >= 0, max(maximum_age) >= 199) %>%
   group_by(age_category = case_when(minimum_age >= 0 & 
                 maximum_age < 15 ~ "LT15",
         minimum_age >=15 ~ "GTE15"), .add = TRUE) %>% 
   summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 6 x 4
#  country metric  age_category value
#* <chr>   <chr>   <chr>        <int>
#1 Bolivia MetricA GTE15           26
#2 Bolivia MetricA LT15            27
#3 USA     MetricA GTE15           10
#4 USA     MetricA LT15            12
#5 USA     MetricB GTE15           23
#6 USA     MetricB LT15             8

data

df <- type.convert(df, as.is = TRUE)

Upvotes: 2

Possible in R Tidyverse to aggregate across rows, combining different age typologies together, only keeping complete age range sets?

Answers (3)

Note

data

Related Questions