Reputation: 99
I have a tricky challenge on a project that I'm not sure quite how to approach or even if it is possible at all.
I have a dataset where the values for two metrics (MetricA and MetricB here) are scraped from government reports from a range of different countries. The data is in a Tidy (long) format and each row is an observation for one metric for one area for one age group.
I am reporting the age ranges for each metric for those less than 15 years old (and for those who are 15 years old and older. The tricky thing is that many countries, but not all, use that typology.
Those countries where the reported values can't be mapped to this typology are discarded and those countries where data is incomplete for all ages is discarded. So, for example, something which had values for "< 15 years old", "15 to 49", "50+") would be kept, while a different country which had "< 18" and "18-49", but that's it, would be discarded, since it didn't have the "50+" people.
So, the challenge is this, I need to be able to pick out only those countries where the rows for that country cover the full gamut from 0 to whatever (I've been using 199 years old as a hypothetical max age in this case), with no overlap between age categories, with a break in those categories so that the data can be aggregated to form both a LT15 (Less than 15 years old) and a GTE15 (Greater than or equal to 15 years old) category. These are marked by minimum_age and maximum_age in my dataset. The proxy value for maximum_age when the value is to infinity (e.g. "18+" or ">59 years old" is 199.
Ideally this would use Tidyverse, but any package that could make this work would be fine.
A reprex for my dataset is here. In this case the Portuguese data would be discarded because there was no information for the 51 and greater age category.
country <- c("USA", "USA", "USA", "USA","Bolivia", "Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal")
metric <- c("MetricA", "MetricA","MetricB", "MetricB","MetricA", "MetricA","MetricA", "MetricA", "MetricB", "MetricB")
metric_value <- c(12, 10,8, 23, 4, 23, 14, 12, 1, 23)
minimum_age <- c(0, 15, 0, 15, 0, 3, 15, 30, 0, 19)
maximum_age <- c(14, 199, 14, 199, 2, 14, 29, 199, 18, 50)
df <- data.frame(country, metric, value, minimum_age, maximum_age)
The output from this would be something like:
country metric value age_category
1 USA MetricA 12 LT15
2 USA MetricA 10 GTE15
3 USA MetricB 8 LT15
4 USA MetricB 23 GTE15
5 Bolivia MetricA 27 LT15
6 Bolivia MetricA 26 GTE15
This doesn't seem like it's possible to me, but I'm sure that I'm missing something here. Is this a challenge that's even doable?
Thanks in advance!
EDIT: Added an expected output block EDIT2: Changed the reprex so that the numeric values were being entered as numeric and not as character. Also changed the reprex so that the "value" column was more clearly referring to the metric.
Upvotes: 1
Views: 694
Reputation: 269694
is.continuous
takes a vector of start ages and a vector of end ages and ensures that all ages between the smallest and largest are covered with no overlaps and no gaps returning TRUE if those are all satisfied and FALSE otherwise. For example
is.continuous(c(0, 5), c(4, 8))
## [1] TRUE
is.continuous(c(0, 3), c(4, 8)) # overlap so FALSE
## [1] FALSE
is.continuous(c(0, 6), c(4, 8)) # gap so FALSE
## [1] FALSE
Now convert the columns that should be numeric to numeric, filter the groups using is.continuous and the other conditions and then categorize the rows, sum and sort.
Regarding the other conditions the question variously refers to the requirement that the maximum be 50+ or that it be equal to 199 and since it was not clear which is the actual condition we used 50+ but we can replace max(maximum_age) >= 50
with max(maximum_age) == 199
depending on what is wanted.
# return true if every age from minimum to maximum occurs once
is.continuous <- function(st, en) {
s <- do.call("c", Map(seq, st, en))
rng <- range(s)
all(table(c(s, rng[1]:rng[2])) == 2)
}
df %>%
mutate(value = as.numeric(value), minimum_age = as.numeric(minimum_age),
maximum_age = as.numeric(maximum_age)) %>%
group_by(country, metric) %>%
filter(is.continuous(minimum_age, maximum_age) &&
14 %in% maximum_age && 15 %in% minimum_age &&
max(maximum_age) >= 50) %>%
group_by(age_category = if_else(maximum_age < 15,"LT15", "GTE15"), .add=TRUE) %>%
summarise(value = sum(value), .groups = "drop") %>%
arrange(country, metric, desc(age_category))
giving:
# A tibble: 6 x 4
country metric age_category value
<chr> <chr> <chr> <dbl>
1 Bolivia MetricA LT15 27
2 Bolivia MetricA GTE15 26
3 USA MetricA LT15 12
4 USA MetricA GTE15 10
5 USA MetricB LT15 8
6 USA MetricB GTE15 23
The input in the question was slightly changed but above we are using the original input which in reproducible form is:
df <-
structure(list(country = c("USA", "USA", "USA", "USA", "Bolivia",
"Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal"), metric = c("MetricA",
"MetricA", "MetricB", "MetricB", "MetricA", "MetricA", "MetricA",
"MetricA", "MetricB", "MetricB"), value = c("12", "10", "8",
"23", "4", "23", "14", "12", "1", "23"), minimum_age = c("0",
"15", "0", "15", "0", "3", "15", "30", "0", "19"), maximum_age = c("14",
"199", "14", "199", "2", "14", "29", "199", "18", "50")), class = "data.frame", row.names = c(NA,
-10L))
Upvotes: 1
Reputation: 537
I'm not 100% certain the answer given works for a wider dataset. The example in the question included a 0-18age band and 18-50age band. But it could be 0-18 and 18+. 0-18 is still to be excluded I think...
I think this does, although it is likely more verbose than it needs to be
require(dplyr)
country <- c("USA", "USA", "USA", "USA","Bolivia", "Bolivia", "Bolivia", "Bolivia", "Portugal", "Portugal")
metric <- c("MetricA", "MetricA","MetricB", "MetricB","MetricA", "MetricA","MetricA", "MetricA", "MetricB", "MetricB")
value <- c("12", "10","8", "23", "4", "23", "14", "12", "1", "23")
minimum_age <- c("0", "15", "0", "15", "0", "3", "15", "30", "0", "19")
maximum_age <- c("14", "199", "14", "199", "2", "14", "29", "199", "18", "50")
df <- data.frame(country, metric, value, minimum_age, maximum_age)
keep_case <- function (LT15, LT_valid, GT_valid) {
case_when(
LT15 == T & LT_valid ==T ~ T,
LT15 == F & GT_valid ==T ~ T,
)
}
df %>%
# Make the numeric values numeric instead of characters
mutate(
maximum_age = as.numeric(maximum_age),
minimum_age = as.numeric(minimum_age),
value = as.numeric(value)
) %>%
# track if the age range is less than 15 and / or greater than 15
mutate(
LT15 = if_else(maximum_age < 15, T,F),
GTE15 = if_else(minimum_age >=15, T,F)
) %>%
# group the results
group_by(country, metric, LT15) %>%
#summarise the results by groups
summarise(
value = sum(value),
min_age = min(minimum_age),
max_age = max(maximum_age)
) %>%
# Check if Less that 15s cover whole age range
# Check if over 15s cover whole age range
mutate(LT_valid = if_else(min_age == 0 & max_age ==14, T, F),
GT_valid = if_else(min_age ==15 & max_age ==199, T, F)) %>%
# mark only valid cases to keep
mutate(keep = keep_case(LT15, LT_valid, GT_valid)) %>%
# filter off invalid cases
filter(keep == T) %>%
# remove the columns not required
select(c(country, metric, value, age_category =LT15))%>%
# rename the LT15 from T /F
mutate(age_category = if_else(age_category==T, "LT15", "GTE15"))
Upvotes: 1
Reputation: 887231
Here is an option with dplyr
. Convert the columns types (type.convert
- as some of the numeric columns were created as character class - by quoting). Grouped by 'country' 'metric', filter
out groups that doesn't have 'maximum_age' greater than 50, create a grouping column based on the values in 'minimum_age' and 'maximum_age' and summarise
the 'value' column by taking the sum
library(dplyr)
df %>%
group_by(country, metric) %>%
filter(min(minimum_age) >= 0, max(maximum_age) >= 199) %>%
group_by(age_category = case_when(minimum_age >= 0 &
maximum_age < 15 ~ "LT15",
minimum_age >=15 ~ "GTE15"), .add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 6 x 4
# country metric age_category value
#* <chr> <chr> <chr> <int>
#1 Bolivia MetricA GTE15 26
#2 Bolivia MetricA LT15 27
#3 USA MetricA GTE15 10
#4 USA MetricA LT15 12
#5 USA MetricB GTE15 23
#6 USA MetricB LT15 8
df <- type.convert(df, as.is = TRUE)
Upvotes: 2