iskandarblue
iskandarblue

Reputation: 7526

Aggregating by subsets in dplyr

I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.

A random sample of the data that I am working with looks like this:

> df
       auto_id user_id month
164537    7124  240249    10
151635    7358  226423     9
117288    7376  172463     9
177119    6085  199194    11
128904    7110  141608     9
157194    7143  241964     9
71303     6090  141646     7
72480     6808  175910     7
108705    6602  213098     8
97889     7379  185516     8
184906    6405  212580    12
37242     6057  197905     8
157284    6548  162928     9
17910     6885  194180    10
70660     7162  161827     7
8593      7375  207061     8
28712     6311  176373    10
144194    7324  142715     9
73106     7196  176153     7
67065     7392  171039     7
77954     7116  161489     7
59842     7107  162637     7
101819    5994  182973     9
183546    6427  142029    12
102881    6477  188129     8

In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):

full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
    Var1 Freq
1 100231   10
2 100744   17
3 111281    1
4 111814    2
5 113716    3
6 117493    3

As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)

tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),] 
topten <- head(users, n = tenPercent)

Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users

sum(topten$Freq)
[1] 12147

In the end the output should look like this

> output
  month trips
1     7   12147
2     8   ...
3     9   ...
4     10   ...
5     11   ...
6     12   ...

Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried

output <- full_data %>% 
+     group_by(month) %>% 
+     summarise(n = n())

But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :

tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),] 
topten <- head(users, n = tenPercent)

Upvotes: 0

Views: 87

Answers (1)

eipi10
eipi10

Reputation: 93871

The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.

library(dplyr)

full_data %>% group_by(month, user_id) %>%
  tally %>%
  group_by(month) %>%
  filter(percent_rank(n) >= 0.9) %>%
  summarise(n_trips = sum(n))

UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).

set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30), 
                       month=sample(1:12, 10000, replace=TRUE))

Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:

full_data %>% filter(month==1) %>%
  group_by(month, user_id) %>%
  tally %>%
  arrange(desc(n)) %>% as.data.frame 
   month user_id  n
1      1      28 60
2      1      29 57
3      1      26 54
4      1      30 53
5      1      27 49
6      1      22 43
7      1      21 41
8      1      20 40
9      1      23 40
10     1      24 38
11     1      25 38
12     1      19 37
13     1      18 33
14     1      16 28
15     1      15 27
16     1      17 27
17     1      14 26
18     1       9 20
19     1      12 20
20     1      13 20
21     1      10 17
22     1      11 17
23     1       6 15
24     1       7 13
25     1       8 13
26     1       4  9
27     1       5  7
28     1       2  3
29     1       3  2
30     1       1  1

So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:

full_data %>% filter(month==1) %>%
  group_by(month, user_id) %>%
  tally %>%
  group_by(month) %>%
  filter(percent_rank(n) >= 0.9)
  month user_id     n
1     1      26    54
2     1      28    60
3     1      29    57

And the sum of n (the total number of trips for the top 10%) is:

full_data %>% filter(month==1) %>%
  group_by(month, user_id) %>%
  tally %>%
  group_by(month) %>%
  filter(percent_rank(n) >= 0.9) %>%
  summarise(n_trips = sum(n))
  month n_trips
1     1     171

So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.

Upvotes: 1

Related Questions