Androviperian
Androviperian

Reputation: 331

R - Optimizing loop operation on long table format

I have a long format table data containing around 1.2 million rows in the below format

+-----------+-------+
| ProcessId | Event |
+-----------+-------+
|       111 | A     |
|       111 | B     |
|       111 | C     |
|       111 | D     |
|       111 | E     |
|       112 | A     |
|       112 | B     |
|       112 | D     |
|       112 | E     |
|       113 | A     |
|       113 | B     |
|       113 | C     |
|       113 | D     |
|       113 | E     |
+-----------+-------+

The events are already sorted by their time-stamp and I want to generate the process flow string for each process id from the long format table into a separate column to another table like below

processflow <- ('A->B->C->D->E' , 'A->B->D->E' ,'A->B->C->D->E' ,..etc)

I tried the below lines of code. but each of them is taking almost 850 seconds in my core i3, 4GB RAM laptop. what are the alternative ways to do this faster ? Any guidance will be helpful

Version 1 using dplyr package

state_flow_identifier <- function(x){
y <-filter(data_longformat,process_id==x)
y_v <- as.vector(y$Event)
as.character(paste(y_v,collapse = '->'))
}

processflow_detection <- mutate(processflow_detection,processflow = apply(processflow_detection['process_id'],1,function(x)state_flow_identifier(x)))

Version 2 using foreach package

processflow <- foreach(i=processflow_detection$process_id,.combine='c') %do% state_flow_identifier(i)

Thanks.

Upvotes: 1

Views: 80

Answers (3)

Sotos
Sotos

Reputation: 51592

You should also check out data.table for efficiency, (Using @Gopala's data frame)

library(data.table)
setDT(df)[, .(event_flow = paste(Event, collapse = '->')), by = ProcessId]
#   ProcessId    event_flow
#1:       111 A->B->C->D->E
#2:       112 A->B->C->D->E
#3:       113 A->B->C->D->E

Upvotes: 2

user31264
user31264

Reputation: 6737

You don't need any libraries.

df <- data.frame(ProcessId=c(rep(111,5),rep(112,4),rep(113,5)),Event=unlist(strsplit("ABCDEABDEABCDE","")))
len <- nrow(df)
new_process_id <- c(TRUE, df$ProcessId[-len] != df$ProcessId[-1])
sym <- rep(" -> ", len)
sym[new_process_id] <- ","
str <- paste(sym, df$Event, sep="", collapse="")
data.frame(ProcessId=rle(df$ProcessId)$val,EventFlow=unlist(strsplit(str,","))[-1])

Output:

  ProcessId             EventFlow
1       111 A -> B -> C -> D -> E
2       112      A -> B -> D -> E
3       113 A -> B -> C -> D -> E

For 1.25 mln lines, the execution time was about 2 seconds on my laptop

Upvotes: 0

Gopala
Gopala

Reputation: 10483

You can try something like this with dplyr and it's group_by function:

library(dplyr)
df %>%
  group_by(ProcessId) %>%
  summarise(eventFlow = paste(Event, collapse = ' -> '))

Output will be as follows:

Source: local data frame [3 x 2]

  ProcessId             eventFlow
      (dbl)                 (chr)
1       111 A -> B -> C -> D -> E
2       112 A -> B -> C -> D -> E
3       113 A -> B -> C -> D -> E

Note, I used this data frame as input since your's was not in a format to input into R:

df <- data.frame(ProcessId = c(rep(111, 5), rep(112, 5), rep(113, 5)), Event = rep(c('A', 'B', 'C', 'D', 'E'), 3))
df
   ProcessId Event
1        111     A
2        111     B
3        111     C
4        111     D
5        111     E
6        112     A
7        112     B
8        112     C
9        112     D
10       112     E
11       113     A
12       113     B
13       113     C
14       113     D
15       113     E

Upvotes: 1

Related Questions