Reputation: 682
I want to choose the top countries based on how much of the total production they cover. I want the countries I have filtered to cover say at least 50% of the total production.
I have this data:
production = structure(list(iso3_code = c("USA", "CHN", "BRA", "ARG", "UKR",
"IDN", "IND", "MEX", "ROU", "CAN", "IDN", "MYS", "THA", "NGA",
"COL", "ECU", "CMR", "GHA", "PNG", "HND", "USA", "BRA", "ARG",
"CHN", "IND", "PRY", "CAN", "UKR", "RUS", "BOL", "BRA", "IND",
"CHN", "THA", "PAK", "MEX", "COL", "GTM", "AUS", "USA"), country = c("USA",
"China", "Brazil", "Argentina", "Ukraine", "Indonesia", "India",
"Mexico", "Romania", "Canada", "Indonesia", "Malaysia", "Thailand",
"Nigeria", "Colombia", "Ecuador", "Cameroon", "Ghana", "Papua New Guinea",
"Honduras", "USA", "Brazil", "Argentina", "China", "India", "Paraguay",
"Canada", "Ukraine", "Russia", "Bolivia", "Brazil", "India",
"China", "Thailand", "Pakistan", "Mexico", "Colombia", "Guatemala",
"Australia", "USA"), item = c("Maize", "Maize", "Maize", "Maize",
"Maize", "Maize", "Maize", "Maize", "Maize", "Maize", "Oil palm fruit",
"Oil palm fruit", "Oil palm fruit", "Oil palm fruit", "Oil palm fruit",
"Oil palm fruit", "Oil palm fruit", "Oil palm fruit", "Oil palm fruit",
"Oil palm fruit", "Soybeans", "Soybeans", "Soybeans", "Soybeans",
"Soybeans", "Soybeans", "Soybeans", "Soybeans", "Soybeans", "Soybeans",
"Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane",
"Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane"
), value = c(392450840, 257173900, 82288298, 43462323, 35801050,
30253938, 27820000, 27169977, 18663939, 13884800, 115267491,
98419400, 15400000, 7850000, 5878504, 2785756, 2641118, 2604387,
2496325, 2493910, 123664230, 117887672, 37787927, 14189217, 13786000,
11045971, 7266600, 4460770, 4026850, 2942131, 746828157, 376900000,
108097100, 104360867, 67173975, 56841523, 36276860, 35568207,
33506830, 31335984)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -40L), spec = structure(list(cols = list(
iso3_code = structure(list(), class = c("collector_character",
"collector")), country = structure(list(), class = c("collector_character",
"collector")), item = structure(list(), class = c("collector_character",
"collector")), value = structure(list(), class = c("collector_double",
"collector")), share = structure(list(), class = c("collector_double",
"collector")), cumulative_share = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
What I have done is that I calculate within the group "item" the share for each country and the cumulative share. Then I filter within each group that has at most 0.5 in cumulative share.
production %>%
group_by(item) %>%
mutate(share = value / sum(value),
cumulative_share = cumsum(share)) %>%
filter(cumulative_share <= 0.5)
Which gives me:
# A tibble: 4 x 6
# Groups: item [4]
iso3_code country item value share cumulative_share
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 USA USA Maize 392450840 0.422 0.422
2 IDN Indonesia Oil palm fruit 115267491 0.451 0.451
3 USA USA Soybeans 123664230 0.367 0.367
4 BRA Brazil Sugar cane 746828157 0.468 0.468
But I want to within each group "item" select the top countries that cover at least 50% of the total of "value" within that group. So for maize I want to select USA and China (since they cover 0.422 + 0.277 = 0.699 > 0.5).
Any ideas on how to solve this?
Upvotes: 1
Views: 39
Reputation: 35554
cumsum()
a logical vector to find the first cumulative_share
greater than or equal to 0.5.
library(dplyr)
production %>%
group_by(item) %>%
mutate(share = value / sum(value),
cumulative_share = cumsum(share)) %>%
filter(cumsum(cumulative_share >= 0.5) <= 1) %>%
ungroup()
# # A tibble: 8 x 6
# iso3_code country item value share cumulative_share
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 USA USA Maize 392450840 0.422 0.422
# 2 CHN China Maize 257173900 0.277 0.699
# 3 IDN Indonesia Oil palm fruit 115267491 0.451 0.451
# 4 MYS Malaysia Oil palm fruit 98419400 0.385 0.835
# 5 USA USA Soybeans 123664230 0.367 0.367
# 6 BRA Brazil Soybeans 117887672 0.350 0.717
# 7 BRA Brazil Sugar cane 746828157 0.468 0.468
# 8 IND India Sugar cane 376900000 0.236 0.704
Upvotes: 1
Reputation: 388817
Do you need something like this?
library(dplyr)
production %>%
group_by(item) %>%
mutate(share = value / sum(value),
cumulative_share = cumsum(share)) %>%
ungroup %>%
slice(seq_len(which.max(cumulative_share >= 0.5)))
# A tibble: 2 x 6
# iso3_code country item value share cumulative_share
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 USA USA Maize 392450840 0.422 0.422
#2 CHN China Maize 257173900 0.277 0.699
This selects rows till cumulative_share
doesn't cross 0.5
.
Upvotes: 0
Reputation: 4658
Use lag
to select everything up to and including the first instance where your condition is false. Using default = TRUE
makes sure that everything is shifted and a TRUE
is prepended, such that the first observation is included as well.
library(tidyverse)
production %>%
group_by(item) %>%
arrange(desc(value)) %>%
mutate(share = value / sum(value),
cumulative_share = cumsum(share)) %>%
filter(lag(cumulative_share <= 0.5, default = TRUE)) %>%
ungroup()
# iso3_code country item value share cumulative_share
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 BRA Brazil Sugar cane 746828157 0.468 0.468
# 2 USA USA Maize 392450840 0.422 0.422
# 3 IND India Sugar cane 376900000 0.236 0.704
# 4 CHN China Maize 257173900 0.277 0.699
# 5 USA USA Soybeans 123664230 0.367 0.367
# 6 BRA Brazil Soybeans 117887672 0.350 0.717
# 7 IDN Indonesia Oil palm fruit 115267491 0.451 0.451
# 8 MYS Malaysia Oil palm fruit 98419400 0.385 0.835
Upvotes: 3