Reputation: 16881
This seems like a simple problem, but I'm having trouble wrapping my mind around it. I have a data frame of locations with population by region of birth, and I'm trying to filter for the regions whose combined population exceeds a threshold—in this case, 50%.
For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean." To be able to phrase it that way, I need to include the first country that gets over the 50% mark.
An abridged version of my data, along with the first few rows for each location, is here:
library(tidyverse)
df <- structure(list(name = c("Fairfield County", "Fairfield County",
"Fairfield County", "Fairfield County", "Greater Hartford", "Greater Hartford",
"Greater Hartford", "Greater Hartford", "Greater Hartford"),
subregion = c("South America", "Central America", "Caribbean",
"South Central Asia", "Caribbean", "Eastern Europe", "South Central Asia",
"South America", "Southern Europe"),
pop = c(40565, 33919, 32044, 17031, 26939, 23765, 20153, 14384, 9309),
cum_share = c(0.2, 0.38, 0.54, 0.62, 0.2, 0.37, 0.51, 0.62, 0.69)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L))
df %>%
group_by(name) %>%
top_n(4, pop)
#> # A tibble: 8 x 4
#> # Groups: name [2]
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Fairfield County South Central Asia 17031 0.62
#> 5 Greater Hartford Caribbean 26939 0.2
#> 6 Greater Hartford Eastern Europe 23765 0.37
#> 7 Greater Hartford South Central Asia 20153 0.51
#> 8 Greater Hartford South America 14384 0.62
My first plan was to filter for where the cumulative share was less than or equal to 51%, meaning the top-ranking regions until reaching a majority of the population. The problem with that is that because these shares aren't a continuous distribution, having a set cutoff point like this doesn't work, because I need to include the first region for which the cumulative share is a majority.
df %>%
filter(cum_share <= 0.51)
#> # A tibble: 5 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Greater Hartford Caribbean 26939 0.2
#> 4 Greater Hartford Eastern Europe 23765 0.37
#> 5 Greater Hartford South Central Asia 20153 0.51
As you can see by comparing to the first snapshot, Greater Hartford works as I'd expect. But Fairfield County should include the Caribbean, at which the cumulative share is 54%; by filtering with a set threshold of 51%, Caribbean isn't included. What I'd like to get is instead like this:
#> # A tibble: 6 x 4
#> name subregion pop cum_share
#> <chr> <chr> <dbl> <dbl>
#> 1 Fairfield County South America 40565 0.2
#> 2 Fairfield County Central America 33919 0.38
#> 3 Fairfield County Caribbean 32044 0.54
#> 4 Greater Hartford Caribbean 26939 0.2
#> 5 Greater Hartford Eastern Europe 23765 0.37
#> 6 Greater Hartford South Central Asia 20153 0.51
Here, the first place at which the share exceeds 50% is also included. I could filter manually, but I'm actually doing this by country, not region of the world, and for 18 locations, so it becomes unwieldy.
Thanks in advance!
Edit: Wow, I'm realizing my own foolishness—I could have calculated cumulative shares from populations in ascending order, not descending, and then easily filtered for where this threshold exceeds 50%. I'll leave this up, though, to help out someone who doesn't have control over their data in this way.
Upvotes: 2
Views: 218
Reputation: 66819
For example, for each location I need to be able to say something like, "In Fairfield County, a majority of the foreign-born population were born in Central and South America or the Caribbean."
For the general case of stopping after a condition is met, there's filter(lag(cumsum(cond), default=FALSE) == 0)
> df %>% group_by(name) %>% filter(cumsum(lag(cum_share > 0.5, default = FALSE)) == 0)
# A tibble: 6 x 4
# Groups: name [2]
name subregion pop cum_share
<chr> <chr> <dbl> <dbl>
1 Fairfield County South America 40565 0.20
2 Fairfield County Central America 33919 0.38
3 Fairfield County Caribbean 32044 0.54
4 Greater Hartford Caribbean 26939 0.20
5 Greater Hartford Eastern Europe 23765 0.37
6 Greater Hartford South Central Asia 20153 0.51
The OP identified a simpler filter in the case of a monotone condition (ie, one such that after first meeting the condition, later elements of the vector also do so): filter(lag(cum_share, default = 0) <= 0.5)
.
There's probably a good way to wrap this in a function (mutate .cond
from user input; mutate .keep
criterion = cumsum(lag(.cond, default=FALSE) == 0)
; filter; drop .cond
and .keep
), but I don't have the tidyverse NSE skills for the first step.
Upvotes: 4