Reputation: 3621
I am trying to dcast
a large dataset (millions of rows). I have one row for arrival time and origin, and another row for departure time and destination. There is an id
to identify the unit in both cases. It looks similar to this:
id time movement origin dest
1 10/06/2011 15:54 ARR 15 15
1 10/06/2011 16:14 DEP 15 29
2 10/06/2011 17:59 ARR 73 73
2 10/06/2011 18:10 DEP 73 75
2 10/06/2011 21:10 ARR 75 75
2 10/06/2011 21:20 DEP 75 73
3 10/06/2011 17:14 ARR 17 17
3 10/06/2011 18:01 DEP 17 48
4 10/06/2011 17:14 ARR 49 49
4 10/06/2011 17:26 DEP 49 15
So, I would like to reallocate the pairs (ARR
-DEP
) and do this efficiently (as here). As it is a very large dataset a for loop
wouldn't work in this case. The ideal output would be
index unitid origin arr time dest dep time
1 1 15 10/06/2011 14:33 29 10/06/2011 19:24
2 2 73 10/06/2011 14:59 75 10/06/2011 17:23
3 2 75 10/06/2011 21:10 73 10/06/2011 23:40
Data:
df <- structure(list(time = structure(c(7L, 16L, 8L, 11L, 18L, 20L,
10L, 12L, 3L, 6L, 15L, 19L, 9L, 4L, 5L, 14L, 1L, 2L, 13L, 17L
), .Label = c("10/06/2011 09:08", "10/06/2011 10:54", "10/06/2011 11:38",
"10/06/2011 12:41", "10/06/2011 12:54", "10/06/2011 14:26", "10/06/2011 14:33",
"10/06/2011 14:59", "10/06/2011 17:12", "10/06/2011 17:14", "10/06/2011 17:23",
"10/06/2011 18:56", "10/06/2011 19:03", "10/06/2011 19:04", "10/06/2011 19:16",
"10/06/2011 19:24", "10/06/2011 20:12", "10/06/2011 21:10", "10/06/2011 22:28",
"10/06/2011 23:40"), class = "factor"), movement = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L,
2L, 2L, 3L), .Label = c("ARR", "DEP", "ITZ"), class = "factor"),
origin = c(15L, 15L, 73L, 73L, 75L, 75L, 17L, 17L, 49L, 49L,
15L, 15L, 32L, 10L, 10L, 17L, 76L, 76L, 76L, 76L), dest = c(15L,
29L, 73L, 75L, 75L, 73L, 17L, 48L, 49L, 15L, 15L, 49L, 32L,
10L, 17L, 10L, 76L, 65L, 76L, 65L), id = c(1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 6L, 6L, 6L, 7L, 7L, 8L,
8L)), .Names = c("time", "movement", "origin", "dest", "id"
), row.names = c(NA, -20L), class = "data.frame")
Upvotes: 2
Views: 342
Reputation: 118799
How about this? Using data.table
:
require(data.table)
setorder(setDT(df), id, time)
df[, grp := FALSE][movement == "ARR", grp := TRUE]
df[, .(time[grp], time[!grp], origin[grp], dest[!grp]), by=id]
# id V1 V2 V3 V4
# 1: 1 10/06/2011 14:33:57 10/06/2011 19:24:16 15 29
# 2: 2 10/06/2011 14:59:14 10/06/2011 17:23:20 73 75
# 3: 2 10/06/2011 21:10:56 10/06/2011 23:40:29 75 73
# 4: 3 10/06/2011 17:14:44 10/06/2011 18:56:39 17 48
# 5: 4 10/06/2011 11:38:43 10/06/2011 14:26:43 49 15
# 6: 4 10/06/2011 19:16:55 10/06/2011 22:28:14 15 49
# 7: 5 10/06/2011 10:41:20 10/06/2011 12:54:26 10 17
# 8: 6 10/06/2011 09:08:05 10/06/2011 10:54:48 76 65
You can make this slightly faster if you add another column with value !grp
and use that column instead of doing !grp
on each group.
How this works:
setDT
converts data.frame to data.table by reference.
setorder
reorders a data.table by reference based on the columns (and the order) provided. Here, it reorders the rows of df
in increasing order based on columns id
and time
.
Then we use data.table's sub-assign by reference to add an extra column which holds the value TRUE
when movement == "ARR"
and FALSE
when movement == "DEP"
.
Note: The factor levels in your df$movement
column has an additional level called ITZ
which doesn't seem to be in this sample data. Not sure how to handle that.
Now all we have to do is pick 1,3,5, .. elements from origin
and 2,4,6,... elements from dest
(and similarly for time
).
This works as long as ARR
time is always before DEP
time (which is very much a valid assumption).
Following OP's edit to the Q with inconsistencies in data:
na.omit(df[movement != "ITZ", .(time[grp], time[!grp], origin[grp], dest[!grp]), by=id])
Upvotes: 3
Reputation: 14346
If your dataset is structured like in the example, i.e. there is one arrival and one departure time per id and origin, then you may be able to do this manually, just by re-ordering and subsetting your data (of course you have to be very very careful with this, and try to add as many checks as possible, like the one below, to catch errors)
dat <- df[order(df$id, df$origin, df$dest, df$movement), ]
dat.dep <- dat[dat$movement == "DEP", ]
dat.arr <- dat[dat$movement == "ARR", ]
stopifnot(nrow(dat.dep) == nrow(dat.arr) &
dat.dep$origin == dat.arr$origin &
dat.dep$id == dat.arr$id)
result <- dat.dep[c("id", "origin", "dest")]
result$arr.time <- dat.arr$time
result$dep.time <- dat.dep$time
result
# id origin dest arr.time dep.time
# 2 1 15 29 10/06/2011 14:33:57 10/06/2011 19:24:16
# 4 2 73 75 10/06/2011 14:59:14 10/06/2011 17:23:20
# 6 2 75 73 10/06/2011 21:10:56 10/06/2011 23:40:29
# 8 3 17 48 10/06/2011 17:14:44 10/06/2011 18:56:39
# 12 4 15 49 10/06/2011 19:16:55 10/06/2011 22:28:14
# 10 4 49 15 10/06/2011 11:38:43 10/06/2011 14:26:43
# 14 5 10 17 10/06/2011 10:41:20 10/06/2011 12:54:26
# 16 6 76 65 10/06/2011 09:08:05 10/06/2011 10:54:48
Upvotes: 1