rsd_unleashed
rsd_unleashed

Reputation: 161

Group items and find the top n% from each group

I have data in the below format. I've to find the sum of values of the top 25% in each group - grouped by ID.

If the number corresponding to the top 25% in a group is a fraction, give the data for the ceil(higher number) of that fraction

ID  Value
3   10
3   83
1   13
2   62
2   26
3   63
3   37
1   17
3   22
3   41
3   45
2   65
3   63
1   45
2   71
3   61
2   97
2   89
3   72
3   16
1   22

While I was able to get the top n rows from using the below code, I've not figured a way to get the top n% in each group

p2<-group_by(data, ID) %>%
mutate(rank = rank(desc(Value))) %>%
arrange(rank) %>%
filter(rank <= 3)

The expected result looks something like this. There are 4 rows in the data with ID 1 - top 25% means 1 row.

There are 6 rows with ID 2 - ceil(0.25*6) is 2 - so top 2 rows. Similarly, for ID 3, there are 11 rows - ceil(0.25*11) is top 3 rows under ID 3.

Please let me know how to get this!

ID  Value
1   45
2   97
2   89
3   83
3   72
3   65

Upvotes: 1

Views: 326

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388807

We can arrange the dataframe in descending order of Value and then slice the number of rows based on the ratio for each group.

library(dplyr)

df %>%
  arrange(desc(Value)) %>%
  group_by(ID) %>%
  slice(1:ceiling(0.25 * n()))

#     ID Value
#  <int> <int>
#1     1    45
#2     2    97
#3     2    89
#4     3    83
#5     3    72
#6     3    63

We can also do this in base R. First order the rows based on Value, split the rows based on ID and then select top n rows from each group.

new_df <- df[order(df$Value, decreasing = TRUE), ]
new_df[unlist(Map(head, split(1:nrow(new_df), new_df$ID), 
                  ceiling(table(new_df$ID) * 0.25))), ]

#   ID Value
#14  1    45
#17  2    97
#18  2    89
#2   3    83
#19  3    72
#6   3    63

Upvotes: 7

Related Questions