Domnick
Domnick

Reputation: 519

Make a range of ID's based on sum of values in R

Being not so well versed with r, don't know if this is a simple problem or not. I want to create a range of ID's based on their sum of values which make up 60%(or approx) of total sum. here is the dataframe. DF

ID     Val
98     2
98     1
98     4
3     11
3      6
3      8
3      1
24     3
24     2
46     1
46     2
59     6

Such that I would first sort the DF by ID and then check for which range of ID's the Value sums upto 60% and group them and for rest, group them by 10%,10%,10%,10%(or it could be random 10%,10%,20% or 5%,15%,10%,10%). such that my dataframe would look like

ID     Val
3-24   35           # (11+6+8+1+3+2) ~ 62% of the total sum of `Val` column
46-59  9            # (1+2+6) = 18% of the total sum of `Val` column
98     7            # (2+1+4) =14% of the total sum of `Val` column

I could try this

DF=DF[with(DF, order(DF$ID)), ]
perce = round(sum(DF$ID)*60/100)
for(i in 1:dim(DF)[1]){
     if(sum(DF$Val) == perce){
      ID=which(DF$ID)
       .
       .
       .
put those ID's in a range that constitutes 60%

       }
    }

I don't know if this could be possible or not.?

Thanks Domnick

Upvotes: 1

Views: 194

Answers (1)

Paul
Paul

Reputation: 9087

First, we sort the data and get the sum of each ID-group.

Then we can use cumsum(Val) to get the running total. We need to lag this so it represents "the sum of all ID-group's values before this row".

Now, we can use cut to assign the cumulative sum to the interval groups (-∞, 0.6 * total], (0.7 * total, 0.8 * total] and (0.8 * total, ∞).

Then we can group_by this interval and get the sum of Val.

library('tidyverse')

df <- tribble(
  ~ID, ~Val,
   98,    2,
   98,    1,
   98,    4,
    3,    11,
    3,    6,
    3,    8,
    3,    1,
   24,    3,
   24,    2,
   46,    1,
   46,    2,
   59,    6
)

breaks_proportions <- c(0.6, 0.1, 0.1)
breaks_values <- cumsum(breaks_proportions) * sum(df$Val)

df %>%
  arrange(ID) %>%
  group_by(ID) %>%
  summarise(Val = sum(Val)) %>%
  mutate(
    running_total = lag(cumsum(Val), default = 0),
    group = cut(
      running_total,
      c(-Inf, breaks_values, Inf))) %>%
  group_by(group) %>%
  summarise(
    ID = stringr::str_c(min(ID), '-', max(ID)),
    Val = sum(Val)) %>%
  select(ID, Val)
# # A tibble: 4 x 2
#      ID   Val
#   <chr> <dbl>
# 1  3-24    31
# 2 46-46     3
# 3 59-59     6
# 4 98-98     7

Upvotes: 2

Related Questions