Reputation: 593
I have two data manipulation requests for the table below.
I want to:
Duplicate each sub-work record and flag them as start and end within a new variable named status. To do this, timeStamp of the following sub-work must be coded as the ending time of the previous sub-work. For each work, the start and end timestamps of the last sub-work will be the same since there is no following sub-work.
Create a column named subWorkInstanceID that indicates the order of sub-works within each different workID.
Note: The original table has millions of records; so, I appreciate a fast solution if possible.
Thanks in advance.
Original format:
Desired output:
Create sample table:
dt <- read.table(text = "workID,subWorkID,timeStamp w1,a,2015-01-08 13:27:14 w1,b,2015-01-08 15:45:43 w1,c,2015-01-08 15:53:36 w1,a,2015-01-08 16:15:08 w2,a,2015-04-13 13:34:33 w2,b,2015-04-13 13:36:13 w2,k,2015-04-13 13:39:20", sep = ",", header = T, stringsAsFactors = FALSE, colClasses = c("character", "character", "POSIXct") )
Edit1: Tables are updated; now, they are align with the data provided in the code above.
Edit2: During the test of the answers, I realized that, in the original data, there are subWorkIDs those are being revisited(repeated) again within the same workID. To reflect this situation to the sample table, I changed the value of subWorkID on the fourth row from 'e' to 'a'. Also updated the desired output. Could you consider these repetitions as a new subWorkInstanceIDs? I should have realized this before; sorry for causing extra work! Thanks again.
Upvotes: 1
Views: 119
Reputation: 39154
A solution using tidyverse and rleid
from data.table.
library(tidyverse)
library(data.table)
dt2 <- dt %>%
# Repeat each row by two
slice(rep(row_number(), each = 2)) %>%
group_by(workID) %>%
# Move the timestamp column up by one element
mutate(timeStamp = lead(timeStamp),
# Create run length ID based on subWorkID
subWorkInstanceID = rleid(subWorkID)) %>%
# Fill the NA in timestamp by the previous non-NA value
fill(timeStamp) %>%
ungroup() %>%
# Create the status column
mutate(status = rep(c("start", "end"), n()/2)) %>%
# Reorder the columns
select(names(dt), timeStamp, status, subWorkInstanceID)
dt2
# # A tibble: 14 x 5
# workID subWorkID timeStamp status subWorkInstanceID
# <chr> <chr> <dttm> <chr> <int>
# 1 w1 a 2015-01-08 13:27:14 start 1
# 2 w1 a 2015-01-08 15:45:43 end 1
# 3 w1 b 2015-01-08 15:45:43 start 2
# 4 w1 b 2015-01-08 15:53:36 end 2
# 5 w1 c 2015-01-08 15:53:36 start 3
# 6 w1 c 2015-01-08 16:15:08 end 3
# 7 w1 a 2015-01-08 16:15:08 start 4
# 8 w1 a 2015-01-08 16:15:08 end 4
# 9 w2 a 2015-04-13 13:34:33 start 1
# 10 w2 a 2015-04-13 13:36:13 end 1
# 11 w2 b 2015-04-13 13:36:13 start 2
# 12 w2 b 2015-04-13 13:39:20 end 2
# 13 w2 k 2015-04-13 13:39:20 start 3
# 14 w2 k 2015-04-13 13:39:20 end 3
DATA
structure(list(workID = c("w1", "w1", "w1", "w1", "w2", "w2",
"w2"), subWorkID = c("a", "b", "c", "a", "a", "b", "k"), timeStamp = structure(c(1420741634,
1420749943, 1420750416, 1420751708, 1428946473, 1428946573, 1428946760
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("workID",
"subWorkID", "timeStamp"), class = "data.frame", row.names = c(NA,
-7L))
Upvotes: 1
Reputation: 79288
You can use data.table
library(data.table)
setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
status=list(rep(c("start","end"),length=nrow(s))),
timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
by=workID][,-3]
workID subWorkID subWorkInstanceID status timestamp
1: w1 a 1 start 2015-01-08 13:27:14
2: w1 a 1 end 2015-01-08 15:45:43
3: w1 b 2 start 2015-01-08 15:45:43
4: w1 b 2 end 2015-01-08 15:53:36
5: w1 c 3 start 2015-01-08 15:53:36
6: w1 c 3 end 2015-01-08 16:15:08
7: w1 e 4 start 2015-01-08 16:15:08
8: w1 e 4 end 2015-01-08 16:15:08
9: w2 a 1 start 2015-04-13 13:34:33
10: w2 a 1 end 2015-04-13 13:36:13
11: w2 b 2 start 2015-04-13 13:36:13
12: w2 b 2 end 2015-04-13 13:39:20
13: w2 k 3 start 2015-04-13 13:39:20
14: w2 k 3 end 2015-04-13 13:39:20
Upvotes: 0
Reputation: 25225
Here is a solution using data.table. Explanation inline.
library(data.table)
setDT(dt)
#create a end time and subWorkInstanceID
wideDT <- dt[, list(subWorkID=subWorkID,
subWorkInstanceID=seq_len(.N),
start=timeStamp,
end=shift(timeStamp, fill=timeStamp[.N], type="lead")),
by=.(workID)]
#melt into OP's desired long format
res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
setorder(res, workID, subWorkID, subWorkInstanceID)
res
# workID subWorkID subWorkInstanceID status timeStamp
# 1: w1 a 1 start 2015-01-08 13:27:14
# 2: w1 a 1 end 2015-01-08 15:45:43
# 3: w1 b 2 start 2015-01-08 15:45:43
# 4: w1 b 2 end 2015-01-08 15:53:36
# 5: w1 c 3 start 2015-01-08 15:53:36
# 6: w1 c 3 end 2015-01-08 16:15:08
# 7: w1 e 4 start 2015-01-08 16:15:08
# 8: w1 e 4 end 2015-01-08 16:15:08
# 9: w2 a 1 start 2015-04-13 13:34:33
#10: w2 a 1 end 2015-04-13 13:36:13
#11: w2 b 2 start 2015-04-13 13:36:13
#12: w2 b 2 end 2015-04-13 13:39:20
#13: w2 k 3 start 2015-04-13 13:39:20
#14: w2 k 3 end 2015-04-13 13:39:20
Upvotes: 1