Reputation: 593

r - duplicate records to indicate start and end times and flag them in a new column acordingly

I have two data manipulation requests for the table below.

I want to:

Duplicate each sub-work record and flag them as start and end within a new variable named status. To do this, timeStamp of the following sub-work must be coded as the ending time of the previous sub-work. For each work, the start and end timestamps of the last sub-work will be the same since there is no following sub-work.
Create a column named subWorkInstanceID that indicates the order of sub-works within each different workID.

Note: The original table has millions of records; so, I appreciate a fast solution if possible.

Thanks in advance.

Original format:

Desired output:

Create sample table:

dt <- read.table(text = "workID,subWorkID,timeStamp
w1,a,2015-01-08 13:27:14
w1,b,2015-01-08 15:45:43
w1,c,2015-01-08 15:53:36
w1,a,2015-01-08 16:15:08
w2,a,2015-04-13 13:34:33
w2,b,2015-04-13 13:36:13
w2,k,2015-04-13 13:39:20",
                 sep = ",",
                 header = T,
                 stringsAsFactors = FALSE,
                 colClasses = c("character", "character", "POSIXct")
                 )

Edit1: Tables are updated; now, they are align with the data provided in the code above.

Edit2: During the test of the answers, I realized that, in the original data, there are subWorkIDs those are being revisited(repeated) again within the same workID. To reflect this situation to the sample table, I changed the value of subWorkID on the fourth row from 'e' to 'a'. Also updated the desired output. Could you consider these repetitions as a new subWorkInstanceIDs? I should have realized this before; sorry for causing extra work! Thanks again.

Upvotes: 1

Answers (3)

www

Reputation: 39154

A solution using tidyverse and rleid from data.table.

library(tidyverse)
library(data.table)

dt2 <- dt %>%
  # Repeat each row by two
  slice(rep(row_number(), each = 2)) %>%
  group_by(workID) %>%
  # Move the timestamp column up by one element
  mutate(timeStamp = lead(timeStamp),
         # Create run length ID based on subWorkID
         subWorkInstanceID = rleid(subWorkID)) %>%
  # Fill the NA in timestamp by the previous non-NA value
  fill(timeStamp) %>%
  ungroup() %>%
  # Create the status column
  mutate(status = rep(c("start", "end"), n()/2)) %>%
  # Reorder the columns
  select(names(dt), timeStamp, status, subWorkInstanceID)
dt2
# # A tibble: 14 x 5
#    workID subWorkID timeStamp           status subWorkInstanceID
#    <chr>  <chr>     <dttm>              <chr>              <int>
#  1 w1     a         2015-01-08 13:27:14 start                  1
#  2 w1     a         2015-01-08 15:45:43 end                    1
#  3 w1     b         2015-01-08 15:45:43 start                  2
#  4 w1     b         2015-01-08 15:53:36 end                    2
#  5 w1     c         2015-01-08 15:53:36 start                  3
#  6 w1     c         2015-01-08 16:15:08 end                    3
#  7 w1     a         2015-01-08 16:15:08 start                  4
#  8 w1     a         2015-01-08 16:15:08 end                    4
#  9 w2     a         2015-04-13 13:34:33 start                  1
# 10 w2     a         2015-04-13 13:36:13 end                    1
# 11 w2     b         2015-04-13 13:36:13 start                  2
# 12 w2     b         2015-04-13 13:39:20 end                    2
# 13 w2     k         2015-04-13 13:39:20 start                  3
# 14 w2     k         2015-04-13 13:39:20 end                    3

DATA

structure(list(workID = c("w1", "w1", "w1", "w1", "w2", "w2", 
"w2"), subWorkID = c("a", "b", "c", "a", "a", "b", "k"), timeStamp = structure(c(1420741634, 
1420749943, 1420750416, 1420751708, 1428946473, 1428946573, 1428946760
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("workID", 
"subWorkID", "timeStamp"), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 1

Onyambu

Reputation: 79288

You can use data.table

library(data.table)
setDT(dt)[,c(s<-cbind(.SD,subWorkInstanceID=1:.N)[rep(1:.N,each=2)],
          status=list(rep(c("start","end"),length=nrow(s))),
          timestamp=shift(s[,"timeStamp"],,s[.N,"timeStamp"],"lead")),
          by=workID][,-3]


   workID subWorkID subWorkInstanceID status           timestamp
 1:     w1         a                 1  start 2015-01-08 13:27:14
 2:     w1         a                 1    end 2015-01-08 15:45:43
 3:     w1         b                 2  start 2015-01-08 15:45:43
 4:     w1         b                 2    end 2015-01-08 15:53:36
 5:     w1         c                 3  start 2015-01-08 15:53:36
 6:     w1         c                 3    end 2015-01-08 16:15:08
 7:     w1         e                 4  start 2015-01-08 16:15:08
 8:     w1         e                 4    end 2015-01-08 16:15:08
 9:     w2         a                 1  start 2015-04-13 13:34:33
10:     w2         a                 1    end 2015-04-13 13:36:13
11:     w2         b                 2  start 2015-04-13 13:36:13
12:     w2         b                 2    end 2015-04-13 13:39:20
13:     w2         k                 3  start 2015-04-13 13:39:20
14:     w2         k                 3    end 2015-04-13 13:39:20

Upvotes: 0

chinsoon12

Reputation: 25225

Here is a solution using data.table. Explanation inline.

library(data.table)
setDT(dt)

#create a end time and subWorkInstanceID
wideDT <- dt[, list(subWorkID=subWorkID,
        subWorkInstanceID=seq_len(.N),
        start=timeStamp,
        end=shift(timeStamp, fill=timeStamp[.N], type="lead")), 
    by=.(workID)]

#melt into OP's desired long format
res <- melt(wideDT, measure.vars=c("start", "end"), variable.name="status", value.name="timeStamp")
setorder(res, workID, subWorkID, subWorkInstanceID)
res

#    workID subWorkID subWorkInstanceID status           timeStamp
# 1:     w1         a                 1  start 2015-01-08 13:27:14
# 2:     w1         a                 1    end 2015-01-08 15:45:43
# 3:     w1         b                 2  start 2015-01-08 15:45:43
# 4:     w1         b                 2    end 2015-01-08 15:53:36
# 5:     w1         c                 3  start 2015-01-08 15:53:36
# 6:     w1         c                 3    end 2015-01-08 16:15:08
# 7:     w1         e                 4  start 2015-01-08 16:15:08
# 8:     w1         e                 4    end 2015-01-08 16:15:08
# 9:     w2         a                 1  start 2015-04-13 13:34:33
#10:     w2         a                 1    end 2015-04-13 13:36:13
#11:     w2         b                 2  start 2015-04-13 13:36:13
#12:     w2         b                 2    end 2015-04-13 13:39:20
#13:     w2         k                 3  start 2015-04-13 13:39:20
#14:     w2         k                 3    end 2015-04-13 13:39:20

Upvotes: 1

r - duplicate records to indicate start and end times and flag them in a new column acordingly

Answers (3)

Related Questions