Reputation: 1

Aggregating and ranking of groups in R

I have a dataset in this format in R:

+----------+-------+-----------+
|  Person  | Group | Timestamp |
+----------+-------+-----------+
| Person A | X     | 12:00 PM  |
| Person A | X     | 12:01 PM  |
| Person A | X     | 12:03 PM  |
| Person A | Y     | 12:10 PM  |
| Person A | Y     | 12:11 PM  |
| Person A | Y     | 12:12 PM  |
| Person A | X     | 12:20 PM  |
| Person A | X     | 12:21 PM  |
| Person A | X     | 12:22 PM  |
| …        |       |           |
+----------+-------+-----------+

I need to convert this into this format:

+----------+-------+---------+
|  Person  | Group | Ranking |
+----------+-------+---------+
| Person A | X     |       1 |
| Person A | Y     |       2 |
| Person A | X     |       3 |
| …        |       |         |
+----------+-------+---------+

(group all the similar entries into 1 - the same group can repeat after another group like in the example above - the groups are X > Y > X)

I have hundreds of Persons and ~20 million records. I tried running a for loop but that just takes way too much time.

Please let me know if there is an easier way to achieve this.

Any help is appreciated. Thanks in advance.

Upvotes: 0

Answers (3)

Gregor Thomas

Reputation: 145765

Here's a data.table solution, should be quite fast.

library(data.table)

dt[, .(Ranking = rleid(Group), Group), by = .(Person)][, .SD[1], by = .(Ranking, Person)]
#      Person Ranking Group
# 1: Person A       1     X
# 2: Person A       2     Y
# 3: Person A       3     X

(Original method didn't calculate the rleid for each person separately, edited to fix.)

A different method. Not sure if this will be any faster, but we could conceptualize the problem as keeping rows where either the Person or the Group is different from the previous row, then numbering them by group:

dt[is.na(shift(Person)) | shift(Person) != Person | shift(Group) != Group, .(Person, Group)][, Ranking := 1:.N, by = .(Person)][]
#      Person Group Ranking
# 1: Person A     X       1
# 2: Person A     Y       2
# 3: Person A     X       3

Using this data:

dt = fread("  Person  | Group | Timestamp
 Person A | X     | 12:00 PM  
 Person A | X     | 12:01 PM  
 Person A | X     | 12:03 PM  
 Person A | Y     | 12:10 PM  
 Person A | Y     | 12:11 PM  
 Person A | Y     | 12:12 PM  
 Person A | X     | 12:20 PM  
 Person A | X     | 12:21 PM  
 Person A | X     | 12:22 PM", sep = "|")

Upvotes: 3

d.b

Reputation: 32548

library(dplyr)
library(tidyr)
d %>%
    group_by(Person) %>%
    mutate(Ranking = sequence(rle(Group)$lengths) == 1) %>%
    ungroup() %>%
    select(-Timestamp) %>%
    filter(Ranking) %>%
    mutate(Ranking = cumsum(Ranking))
## A tibble: 3 x 3
#  Person   Group Ranking
#  <chr>    <chr>   <int>
#1 Person A X           1
#2 Person A Y           2
#3 Person A X           3

In Base R

do.call(rbind, lapply(split(d, d$Person), function(x){
    data.frame(Person = x$Person[1],
               with(rle(x$Group),
                    data.frame(Group = values,
                               Ranking = seq_along(values))))}))

DATA

d = structure(list(Person = c("Person A", "Person A", "Person A", 
                              "Person A", "Person A", "Person A",
                              "Person A", "Person A", "Person A"),
                   Group = c("X", "X", "X", "Y", "Y", "Y", "X", "X", "X"),
                   Timestamp = c("12:00 PM", "12:01 PM", "12:03 PM", "12:10 PM",
                                 "12:11 PM", "12:12 PM", "12:20 PM", "12:21 PM",
                                 "12:22 PM")),
              class = "data.frame",
              row.names = c(NA, -9L))

Upvotes: 1

qdread

Reputation: 3943

Here is a tidyverse solution that ensures the timestamps are sorted in ascending order within Person before returning the rankings.

library(tidyverse)

get_ranking <- function(data) {
  grps <- rle(data$Group)$values
  data.frame(Group = grps, Ranking = seq_along(grps))
}

dat %>%
  group_by(Person) %>%
  arrange(Timestamp) %>%
  group_modify(~ get_ranking(.x))

Using this data:

dat <- data.frame(Person= 'Person A', 
                  Group=rep(c('X','Y','X'),each=3), 
                  Timestamp=as.POSIXct('2010-01-01 12:00 PM')+(1:9)*60,
                  stringsAsFactors = FALSE)

To produce this output:

# A tibble: 3 x 3
# Groups:   Person [1]
  Person   Group Ranking
  <chr>    <fct>   <int>
1 Person A X           1
2 Person A Y           2
3 Person A X           3

Upvotes: 1

Aggregating and ranking of groups in R

Answers (3)

Related Questions