Reputation: 331
I have a long format table data containing around 1.2 million rows in the below format
+-----------+-------+ | ProcessId | Event | +-----------+-------+ | 111 | A | | 111 | B | | 111 | C | | 111 | D | | 111 | E | | 112 | A | | 112 | B | | 112 | D | | 112 | E | | 113 | A | | 113 | B | | 113 | C | | 113 | D | | 113 | E | +-----------+-------+
The events are already sorted by their time-stamp and I want to generate the process flow string for each process id from the long format table into a separate column to another table like below
processflow <- ('A->B->C->D->E' , 'A->B->D->E' ,'A->B->C->D->E' ,..etc)
I tried the below lines of code. but each of them is taking almost 850 seconds in my core i3, 4GB RAM laptop. what are the alternative ways to do this faster ? Any guidance will be helpful
Version 1 using dplyr package
state_flow_identifier <- function(x){
y <-filter(data_longformat,process_id==x)
y_v <- as.vector(y$Event)
as.character(paste(y_v,collapse = '->'))
}
processflow_detection <- mutate(processflow_detection,processflow = apply(processflow_detection['process_id'],1,function(x)state_flow_identifier(x)))
Version 2 using foreach package
processflow <- foreach(i=processflow_detection$process_id,.combine='c') %do% state_flow_identifier(i)
Thanks.
Upvotes: 1
Views: 80
Reputation: 51592
You should also check out data.table
for efficiency, (Using @Gopala's data frame)
library(data.table)
setDT(df)[, .(event_flow = paste(Event, collapse = '->')), by = ProcessId]
# ProcessId event_flow
#1: 111 A->B->C->D->E
#2: 112 A->B->C->D->E
#3: 113 A->B->C->D->E
Upvotes: 2
Reputation: 6737
You don't need any libraries.
df <- data.frame(ProcessId=c(rep(111,5),rep(112,4),rep(113,5)),Event=unlist(strsplit("ABCDEABDEABCDE","")))
len <- nrow(df)
new_process_id <- c(TRUE, df$ProcessId[-len] != df$ProcessId[-1])
sym <- rep(" -> ", len)
sym[new_process_id] <- ","
str <- paste(sym, df$Event, sep="", collapse="")
data.frame(ProcessId=rle(df$ProcessId)$val,EventFlow=unlist(strsplit(str,","))[-1])
Output:
ProcessId EventFlow
1 111 A -> B -> C -> D -> E
2 112 A -> B -> D -> E
3 113 A -> B -> C -> D -> E
For 1.25 mln lines, the execution time was about 2 seconds on my laptop
Upvotes: 0
Reputation: 10483
You can try something like this with dplyr
and it's group_by
function:
library(dplyr)
df %>%
group_by(ProcessId) %>%
summarise(eventFlow = paste(Event, collapse = ' -> '))
Output will be as follows:
Source: local data frame [3 x 2]
ProcessId eventFlow
(dbl) (chr)
1 111 A -> B -> C -> D -> E
2 112 A -> B -> C -> D -> E
3 113 A -> B -> C -> D -> E
Note, I used this data frame as input since your's was not in a format to input into R:
df <- data.frame(ProcessId = c(rep(111, 5), rep(112, 5), rep(113, 5)), Event = rep(c('A', 'B', 'C', 'D', 'E'), 3))
df
ProcessId Event
1 111 A
2 111 B
3 111 C
4 111 D
5 111 E
6 112 A
7 112 B
8 112 C
9 112 D
10 112 E
11 113 A
12 113 B
13 113 C
14 113 D
15 113 E
Upvotes: 1