User981636
User981636

Reputation: 3621

dcast efficiently large datasets with multiple variables

I am trying to dcast a large dataset (millions of rows). I have one row for arrival time and origin, and another row for departure time and destination. There is an id to identify the unit in both cases. It looks similar to this:

id  time            movement    origin  dest
1   10/06/2011 15:54    ARR        15    15
1   10/06/2011 16:14    DEP        15    29
2   10/06/2011 17:59    ARR        73    73
2   10/06/2011 18:10    DEP        73    75
2   10/06/2011 21:10    ARR        75    75
2   10/06/2011 21:20    DEP        75    73
3   10/06/2011 17:14    ARR        17    17
3   10/06/2011 18:01    DEP        17    48
4   10/06/2011 17:14    ARR        49    49
4   10/06/2011 17:26    DEP        49    15

So, I would like to reallocate the pairs (ARR-DEP) and do this efficiently (as here). As it is a very large dataset a for loop wouldn't work in this case. The ideal output would be

  index unitid origin   arr time    dest    dep time
    1    1     15   10/06/2011 14:33    29  10/06/2011 19:24
    2    2     73   10/06/2011 14:59    75  10/06/2011 17:23
    3    2     75   10/06/2011 21:10    73  10/06/2011 23:40

Data:

        df <- structure(list(time = structure(c(7L, 16L, 8L, 11L, 18L, 20L, 
10L, 12L, 3L, 6L, 15L, 19L, 9L, 4L, 5L, 14L, 1L, 2L, 13L, 17L
), .Label = c("10/06/2011 09:08", "10/06/2011 10:54", "10/06/2011 11:38", 
"10/06/2011 12:41", "10/06/2011 12:54", "10/06/2011 14:26", "10/06/2011 14:33", 
"10/06/2011 14:59", "10/06/2011 17:12", "10/06/2011 17:14", "10/06/2011 17:23", 
"10/06/2011 18:56", "10/06/2011 19:03", "10/06/2011 19:04", "10/06/2011 19:16", 
"10/06/2011 19:24", "10/06/2011 20:12", "10/06/2011 21:10", "10/06/2011 22:28", 
"10/06/2011 23:40"), class = "factor"), movement = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 
2L, 2L, 3L), .Label = c("ARR", "DEP", "ITZ"), class = "factor"), 
    origin = c(15L, 15L, 73L, 73L, 75L, 75L, 17L, 17L, 49L, 49L, 
    15L, 15L, 32L, 10L, 10L, 17L, 76L, 76L, 76L, 76L), dest = c(15L, 
    29L, 73L, 75L, 75L, 73L, 17L, 48L, 49L, 15L, 15L, 49L, 32L, 
    10L, 17L, 10L, 76L, 65L, 76L, 65L), id = c(1L, 1L, 2L, 2L, 
    2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 6L, 6L, 6L, 7L, 7L, 8L, 
    8L)), .Names = c("time", "movement", "origin", "dest", "id"
), row.names = c(NA, -20L), class = "data.frame")

Upvotes: 2

Views: 342

Answers (2)

Arun
Arun

Reputation: 118799

How about this? Using data.table:

require(data.table)
setorder(setDT(df), id, time)
df[, grp := FALSE][movement == "ARR", grp := TRUE]
df[, .(time[grp], time[!grp], origin[grp], dest[!grp]), by=id]
#    id                  V1                  V2 V3 V4
# 1:  1 10/06/2011 14:33:57 10/06/2011 19:24:16 15 29
# 2:  2 10/06/2011 14:59:14 10/06/2011 17:23:20 73 75
# 3:  2 10/06/2011 21:10:56 10/06/2011 23:40:29 75 73
# 4:  3 10/06/2011 17:14:44 10/06/2011 18:56:39 17 48
# 5:  4 10/06/2011 11:38:43 10/06/2011 14:26:43 49 15
# 6:  4 10/06/2011 19:16:55 10/06/2011 22:28:14 15 49
# 7:  5 10/06/2011 10:41:20 10/06/2011 12:54:26 10 17
# 8:  6 10/06/2011 09:08:05 10/06/2011 10:54:48 76 65

You can make this slightly faster if you add another column with value !grp and use that column instead of doing !grp on each group.


How this works:

  • setDT converts data.frame to data.table by reference.

  • setorder reorders a data.table by reference based on the columns (and the order) provided. Here, it reorders the rows of df in increasing order based on columns id and time.

  • Then we use data.table's sub-assign by reference to add an extra column which holds the value TRUE when movement == "ARR" and FALSE when movement == "DEP".

    Note: The factor levels in your df$movement column has an additional level called ITZ which doesn't seem to be in this sample data. Not sure how to handle that.

  • Now all we have to do is pick 1,3,5, .. elements from origin and 2,4,6,... elements from dest (and similarly for time).

This works as long as ARR time is always before DEP time (which is very much a valid assumption).


Following OP's edit to the Q with inconsistencies in data:

na.omit(df[movement != "ITZ", .(time[grp], time[!grp], origin[grp], dest[!grp]), by=id])

Upvotes: 3

konvas
konvas

Reputation: 14346

If your dataset is structured like in the example, i.e. there is one arrival and one departure time per id and origin, then you may be able to do this manually, just by re-ordering and subsetting your data (of course you have to be very very careful with this, and try to add as many checks as possible, like the one below, to catch errors)

dat <- df[order(df$id, df$origin, df$dest, df$movement), ]
dat.dep <- dat[dat$movement == "DEP", ]
dat.arr <- dat[dat$movement == "ARR", ]
stopifnot(nrow(dat.dep) == nrow(dat.arr) &
    dat.dep$origin == dat.arr$origin & 
    dat.dep$id == dat.arr$id)
result <- dat.dep[c("id", "origin", "dest")]
result$arr.time <- dat.arr$time
result$dep.time <- dat.dep$time
result 
#    id origin dest            arr.time            dep.time
# 2   1     15   29 10/06/2011 14:33:57 10/06/2011 19:24:16
# 4   2     73   75 10/06/2011 14:59:14 10/06/2011 17:23:20
# 6   2     75   73 10/06/2011 21:10:56 10/06/2011 23:40:29
# 8   3     17   48 10/06/2011 17:14:44 10/06/2011 18:56:39
# 12  4     15   49 10/06/2011 19:16:55 10/06/2011 22:28:14
# 10  4     49   15 10/06/2011 11:38:43 10/06/2011 14:26:43
# 14  5     10   17 10/06/2011 10:41:20 10/06/2011 12:54:26
# 16  6     76   65 10/06/2011 09:08:05 10/06/2011 10:54:48

Upvotes: 1

Related Questions