Andres
Andres

Reputation: 291

Grouping by Column - Order Issue

I have an issue with summarising after grouping by column. I have a data set that looks like this:

    ID       KEY
  320249T    75
  320249T    59
  320249T    83
  320249T    78
  320249T    42
  320249T    78
  320249T    42

(data has around 23 other columns, and around 2 million rows). The data is already ordered by data, i.e., the first row for a particular ID is the latest, and the last row, is the first one. I want to keep the KEY that is most repeated, so I do (df is my data set):

df.freq1 <- df %>% group_by(ID, KEY) %>% summarise(count=n())

which gives me as result:

      ID    KEY count
1:  320249T  42   2
2:  320249T  59   1
3:  320249T  75   1
4:  320249T  78   2
5:  320249T  83   1

In cases in which two (or more keys) have the same amount of "counts", I want to keep the last key in time, which in this particular case, would be 78 (if you see the first table). However, when I do:

df.freq1[,.SD[which.max(count),],by=ID]

I get

      ID    KEY count
1:  320249T  42     2

I guess it chooses 42 because of the order of the result of the grouping, so, how do I do it that the results of the grouping preserves the order of time of the original data frame?

Upvotes: 1

Views: 55

Answers (1)

Axeman
Axeman

Reputation: 35297

Not the most elegant, but this dplyr solution works, and avoids a slow arrange:

df %>%   
  mutate(order = row_number()) %>% 
  group_by(ID, KEY) %>% 
  summarise(n = n(),
            order = first(order)) %>% 
  filter(n == max(n)) %>% 
  slice(which.min(order))
Source: local data frame [1 x 4]
Groups: ID [1]

       ID   KEY     n order
    <chr> <int> <int> <int>
1 320249T    78     2     4

Upvotes: 2

Related Questions