Reputation: 682

How to select the top values based on them covering a certain percentage?

I want to choose the top countries based on how much of the total production they cover. I want the countries I have filtered to cover say at least 50% of the total production.

I have this data:

production = structure(list(iso3_code = c("USA", "CHN", "BRA", "ARG", "UKR", 
"IDN", "IND", "MEX", "ROU", "CAN", "IDN", "MYS", "THA", "NGA", 
"COL", "ECU", "CMR", "GHA", "PNG", "HND", "USA", "BRA", "ARG", 
"CHN", "IND", "PRY", "CAN", "UKR", "RUS", "BOL", "BRA", "IND", 
"CHN", "THA", "PAK", "MEX", "COL", "GTM", "AUS", "USA"), country = c("USA", 
"China", "Brazil", "Argentina", "Ukraine", "Indonesia", "India", 
"Mexico", "Romania", "Canada", "Indonesia", "Malaysia", "Thailand", 
"Nigeria", "Colombia", "Ecuador", "Cameroon", "Ghana", "Papua New Guinea", 
"Honduras", "USA", "Brazil", "Argentina", "China", "India", "Paraguay", 
"Canada", "Ukraine", "Russia", "Bolivia", "Brazil", "India", 
"China", "Thailand", "Pakistan", "Mexico", "Colombia", "Guatemala", 
"Australia", "USA"), item = c("Maize", "Maize", "Maize", "Maize", 
"Maize", "Maize", "Maize", "Maize", "Maize", "Maize", "Oil palm fruit", 
"Oil palm fruit", "Oil palm fruit", "Oil palm fruit", "Oil palm fruit", 
"Oil palm fruit", "Oil palm fruit", "Oil palm fruit", "Oil palm fruit", 
"Oil palm fruit", "Soybeans", "Soybeans", "Soybeans", "Soybeans", 
"Soybeans", "Soybeans", "Soybeans", "Soybeans", "Soybeans", "Soybeans", 
"Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane", 
"Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane", "Sugar cane"
), value = c(392450840, 257173900, 82288298, 43462323, 35801050, 
30253938, 27820000, 27169977, 18663939, 13884800, 115267491, 
98419400, 15400000, 7850000, 5878504, 2785756, 2641118, 2604387, 
2496325, 2493910, 123664230, 117887672, 37787927, 14189217, 13786000, 
11045971, 7266600, 4460770, 4026850, 2942131, 746828157, 376900000, 
108097100, 104360867, 67173975, 56841523, 36276860, 35568207, 
33506830, 31335984)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -40L), spec = structure(list(cols = list(
    iso3_code = structure(list(), class = c("collector_character", 
    "collector")), country = structure(list(), class = c("collector_character", 
    "collector")), item = structure(list(), class = c("collector_character", 
    "collector")), value = structure(list(), class = c("collector_double", 
    "collector")), share = structure(list(), class = c("collector_double", 
    "collector")), cumulative_share = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

What I have done is that I calculate within the group "item" the share for each country and the cumulative share. Then I filter within each group that has at most 0.5 in cumulative share.

production %>% 
  group_by(item) %>% 
  mutate(share = value / sum(value),
         cumulative_share = cumsum(share)) %>% 
  filter(cumulative_share <= 0.5)

Which gives me:

# A tibble: 4 x 6
# Groups:   item [4]
  iso3_code country   item               value share cumulative_share
  <chr>     <chr>     <chr>              <dbl> <dbl>            <dbl>
1 USA       USA       Maize          392450840 0.422            0.422
2 IDN       Indonesia Oil palm fruit 115267491 0.451            0.451
3 USA       USA       Soybeans       123664230 0.367            0.367
4 BRA       Brazil    Sugar cane     746828157 0.468            0.468

But I want to within each group "item" select the top countries that cover at least 50% of the total of "value" within that group. So for maize I want to select USA and China (since they cover 0.422 + 0.277 = 0.699 > 0.5).

Any ideas on how to solve this?

Upvotes: 1

Answers (3)

Darren Tsai

Reputation: 35554

cumsum() a logical vector to find the first cumulative_share greater than or equal to 0.5.

library(dplyr)

production %>% 
  group_by(item) %>% 
  mutate(share = value / sum(value),
         cumulative_share = cumsum(share)) %>%
  filter(cumsum(cumulative_share >= 0.5) <= 1) %>%
  ungroup()

# # A tibble: 8 x 6
#   iso3_code country   item               value share cumulative_share
#   <chr>     <chr>     <chr>              <dbl> <dbl>            <dbl>
# 1 USA       USA       Maize          392450840 0.422            0.422
# 2 CHN       China     Maize          257173900 0.277            0.699
# 3 IDN       Indonesia Oil palm fruit 115267491 0.451            0.451
# 4 MYS       Malaysia  Oil palm fruit  98419400 0.385            0.835
# 5 USA       USA       Soybeans       123664230 0.367            0.367
# 6 BRA       Brazil    Soybeans       117887672 0.350            0.717
# 7 BRA       Brazil    Sugar cane     746828157 0.468            0.468
# 8 IND       India     Sugar cane     376900000 0.236            0.704

Upvotes: 1

Ronak Shah

Reputation: 388817

Do you need something like this?

library(dplyr)

production %>% 
   group_by(item) %>% 
   mutate(share = value / sum(value),
          cumulative_share = cumsum(share)) %>%
   ungroup %>%
   slice(seq_len(which.max(cumulative_share >= 0.5)))

# A tibble: 2 x 6
#  iso3_code country item      value share cumulative_share
#  <chr>     <chr>   <chr>     <dbl> <dbl>            <dbl>
#1 USA       USA     Maize 392450840 0.422            0.422
#2 CHN       China   Maize 257173900 0.277            0.699

This selects rows till cumulative_share doesn't cross 0.5.

Upvotes: 0

Bas

Reputation: 4658

Use lag to select everything up to and including the first instance where your condition is false. Using default = TRUE makes sure that everything is shifted and a TRUE is prepended, such that the first observation is included as well.

library(tidyverse)

production %>% 
  group_by(item) %>% 
  arrange(desc(value)) %>% 
  mutate(share = value / sum(value),
         cumulative_share = cumsum(share)) %>% 
  filter(lag(cumulative_share <= 0.5, default = TRUE)) %>% 
  ungroup()

#   iso3_code country   item               value share cumulative_share
#   <chr>     <chr>     <chr>              <dbl> <dbl>            <dbl>
# 1 BRA       Brazil    Sugar cane     746828157 0.468            0.468
# 2 USA       USA       Maize          392450840 0.422            0.422
# 3 IND       India     Sugar cane     376900000 0.236            0.704
# 4 CHN       China     Maize          257173900 0.277            0.699
# 5 USA       USA       Soybeans       123664230 0.367            0.367
# 6 BRA       Brazil    Soybeans       117887672 0.350            0.717
# 7 IDN       Indonesia Oil palm fruit 115267491 0.451            0.451
# 8 MYS       Malaysia  Oil palm fruit  98419400 0.385            0.835

Upvotes: 3

How to select the top values based on them covering a certain percentage?

Answers (3)

Related Questions