Reputation: 291
I have an issue with summarising after grouping by column. I have a data set that looks like this:
ID KEY
320249T 75
320249T 59
320249T 83
320249T 78
320249T 42
320249T 78
320249T 42
(data has around 23 other columns, and around 2 million rows). The data is already ordered by data, i.e., the first row for a particular ID is the latest, and the last row, is the first one. I want to keep the KEY that is most repeated, so I do (df is my data set):
df.freq1 <- df %>% group_by(ID, KEY) %>% summarise(count=n())
which gives me as result:
ID KEY count
1: 320249T 42 2
2: 320249T 59 1
3: 320249T 75 1
4: 320249T 78 2
5: 320249T 83 1
In cases in which two (or more keys) have the same amount of "counts", I want to keep the last key in time, which in this particular case, would be 78 (if you see the first table). However, when I do:
df.freq1[,.SD[which.max(count),],by=ID]
I get
ID KEY count
1: 320249T 42 2
I guess it chooses 42 because of the order of the result of the grouping, so, how do I do it that the results of the grouping preserves the order of time of the original data frame?
Upvotes: 1
Views: 55
Reputation: 35297
Not the most elegant, but this dplyr
solution works, and avoids a slow arrange:
df %>%
mutate(order = row_number()) %>%
group_by(ID, KEY) %>%
summarise(n = n(),
order = first(order)) %>%
filter(n == max(n)) %>%
slice(which.min(order))
Source: local data frame [1 x 4] Groups: ID [1] ID KEY n order <chr> <int> <int> <int> 1 320249T 78 2 4
Upvotes: 2