Reputation: 519
Being not so well versed with r, don't know if this is a simple problem or not.
I want to create a range of ID's based on their sum of values which make up 60%(or approx) of total sum. here is the dataframe. DF
ID Val
98 2
98 1
98 4
3 11
3 6
3 8
3 1
24 3
24 2
46 1
46 2
59 6
Such that I would first sort the DF
by ID and then check for which range of ID's the Value sums upto 60% and group them and for rest, group them by 10%,10%,10%,10%(or it could be random 10%,10%,20% or 5%,15%,10%,10%). such that my dataframe would look like
ID Val
3-24 35 # (11+6+8+1+3+2) ~ 62% of the total sum of `Val` column
46-59 9 # (1+2+6) = 18% of the total sum of `Val` column
98 7 # (2+1+4) =14% of the total sum of `Val` column
I could try this
DF=DF[with(DF, order(DF$ID)), ]
perce = round(sum(DF$ID)*60/100)
for(i in 1:dim(DF)[1]){
if(sum(DF$Val) == perce){
ID=which(DF$ID)
.
.
.
put those ID's in a range that constitutes 60%
}
}
I don't know if this could be possible or not.?
Thanks Domnick
Upvotes: 1
Views: 194
Reputation: 9087
First, we sort the data and get the sum
of each ID
-group.
Then we can use cumsum(Val)
to get the running total. We need to lag
this so it represents "the sum of all ID
-group's values before this row".
Now, we can use cut
to assign the cumulative sum to the interval groups (-∞, 0.6 * total]
, (0.7 * total, 0.8 * total]
and (0.8 * total, ∞)
.
Then we can group_by
this interval and get the sum
of Val
.
library('tidyverse')
df <- tribble(
~ID, ~Val,
98, 2,
98, 1,
98, 4,
3, 11,
3, 6,
3, 8,
3, 1,
24, 3,
24, 2,
46, 1,
46, 2,
59, 6
)
breaks_proportions <- c(0.6, 0.1, 0.1)
breaks_values <- cumsum(breaks_proportions) * sum(df$Val)
df %>%
arrange(ID) %>%
group_by(ID) %>%
summarise(Val = sum(Val)) %>%
mutate(
running_total = lag(cumsum(Val), default = 0),
group = cut(
running_total,
c(-Inf, breaks_values, Inf))) %>%
group_by(group) %>%
summarise(
ID = stringr::str_c(min(ID), '-', max(ID)),
Val = sum(Val)) %>%
select(ID, Val)
# # A tibble: 4 x 2
# ID Val
# <chr> <dbl>
# 1 3-24 31
# 2 46-46 3
# 3 59-59 6
# 4 98-98 7
Upvotes: 2