Reputation: 1
I have a dataset in this format in R:
+----------+-------+-----------+
| Person | Group | Timestamp |
+----------+-------+-----------+
| Person A | X | 12:00 PM |
| Person A | X | 12:01 PM |
| Person A | X | 12:03 PM |
| Person A | Y | 12:10 PM |
| Person A | Y | 12:11 PM |
| Person A | Y | 12:12 PM |
| Person A | X | 12:20 PM |
| Person A | X | 12:21 PM |
| Person A | X | 12:22 PM |
| … | | |
+----------+-------+-----------+
I need to convert this into this format:
+----------+-------+---------+
| Person | Group | Ranking |
+----------+-------+---------+
| Person A | X | 1 |
| Person A | Y | 2 |
| Person A | X | 3 |
| … | | |
+----------+-------+---------+
(group all the similar entries into 1 - the same group can repeat after another group like in the example above - the groups are X > Y > X)
I have hundreds of Persons and ~20 million records. I tried running a for loop but that just takes way too much time.
Please let me know if there is an easier way to achieve this.
Any help is appreciated. Thanks in advance.
Upvotes: 0
Views: 89
Reputation: 145765
Here's a data.table solution, should be quite fast.
library(data.table)
dt[, .(Ranking = rleid(Group), Group), by = .(Person)][, .SD[1], by = .(Ranking, Person)]
# Person Ranking Group
# 1: Person A 1 X
# 2: Person A 2 Y
# 3: Person A 3 X
(Original method didn't calculate the rleid for each person separately, edited to fix.)
A different method. Not sure if this will be any faster, but we could conceptualize the problem as keeping rows where either the Person or the Group is different from the previous row, then numbering them by group:
dt[is.na(shift(Person)) | shift(Person) != Person | shift(Group) != Group, .(Person, Group)][, Ranking := 1:.N, by = .(Person)][]
# Person Group Ranking
# 1: Person A X 1
# 2: Person A Y 2
# 3: Person A X 3
Using this data:
dt = fread(" Person | Group | Timestamp
Person A | X | 12:00 PM
Person A | X | 12:01 PM
Person A | X | 12:03 PM
Person A | Y | 12:10 PM
Person A | Y | 12:11 PM
Person A | Y | 12:12 PM
Person A | X | 12:20 PM
Person A | X | 12:21 PM
Person A | X | 12:22 PM", sep = "|")
Upvotes: 3
Reputation: 32548
library(dplyr)
library(tidyr)
d %>%
group_by(Person) %>%
mutate(Ranking = sequence(rle(Group)$lengths) == 1) %>%
ungroup() %>%
select(-Timestamp) %>%
filter(Ranking) %>%
mutate(Ranking = cumsum(Ranking))
## A tibble: 3 x 3
# Person Group Ranking
# <chr> <chr> <int>
#1 Person A X 1
#2 Person A Y 2
#3 Person A X 3
In Base R
do.call(rbind, lapply(split(d, d$Person), function(x){
data.frame(Person = x$Person[1],
with(rle(x$Group),
data.frame(Group = values,
Ranking = seq_along(values))))}))
DATA
d = structure(list(Person = c("Person A", "Person A", "Person A",
"Person A", "Person A", "Person A",
"Person A", "Person A", "Person A"),
Group = c("X", "X", "X", "Y", "Y", "Y", "X", "X", "X"),
Timestamp = c("12:00 PM", "12:01 PM", "12:03 PM", "12:10 PM",
"12:11 PM", "12:12 PM", "12:20 PM", "12:21 PM",
"12:22 PM")),
class = "data.frame",
row.names = c(NA, -9L))
Upvotes: 1
Reputation: 3943
Here is a tidyverse solution that ensures the timestamps are sorted in ascending order within Person before returning the rankings.
library(tidyverse)
get_ranking <- function(data) {
grps <- rle(data$Group)$values
data.frame(Group = grps, Ranking = seq_along(grps))
}
dat %>%
group_by(Person) %>%
arrange(Timestamp) %>%
group_modify(~ get_ranking(.x))
Using this data:
dat <- data.frame(Person= 'Person A',
Group=rep(c('X','Y','X'),each=3),
Timestamp=as.POSIXct('2010-01-01 12:00 PM')+(1:9)*60,
stringsAsFactors = FALSE)
To produce this output:
# A tibble: 3 x 3
# Groups: Person [1]
Person Group Ranking
<chr> <fct> <int>
1 Person A X 1
2 Person A Y 2
3 Person A X 3
Upvotes: 1