FG7
FG7

Reputation: 469

R, data.table: programatically add column using calculated info and name from another data table

I have two data tables as follows:

set.seed(1)
landing_fees <- data.table(`date`= c(20160101,20160101,20160101,
            20160102,20160102,20160102,20160103,20160103,20160103), 
            airport=c("LAX","PHX","HNL"), period_a_fee=sample(100:200,9),
            period_b_fee=sample(100:200,9), period_c_fee=sample(100:200,9))
routes <- data.table(origin=c(rep("LAX",4),rep("PHX",4)), 
            destination=c("PHX","HNL","PHX","HNL","LAX","HNL","LAX","HNL"),
            period=rep(c("period_a","period_a","period_b","period_c"),2))

Which output as follows:

> landing_fees
       date airport period_a_fee period_b_fee period_c_fee
1: 20160101     LAX          126          106          138
2: 20160101     PHX          137          120          177
3: 20160101     HNL          156          117          192
4: 20160102     LAX          189          167          120
5: 20160102     PHX          119          137          163
6: 20160102     HNL          186          173          112
7: 20160103     LAX          197          147          125
8: 20160103     PHX          162          197          136
9: 20160103     HNL          158          192          101
> routes
   origin destination   period
1:    LAX         PHX period_a
2:    LAX         HNL period_a
3:    LAX         PHX period_b
4:    LAX         HNL period_c
5:    PHX         LAX period_a
6:    PHX         HNL period_a
7:    PHX         LAX period_b
8:    PHX         HNL period_c

The real datasets are huge with millions of origin/destination combinations and years worth of dates. I would like to add a new column to the routes table for each unique date in landing_fees$date and programatically name the column with the date and _cost (ie 20160101_cost). Then in each new column, sum the fee for origin and the fee for the destination (using matching time period) as listed in the landing_fees table. For instance, in the routes table, a new column routes$20160101_cost would be created and the value of row 1 would be: LAX period_a fee of 126 (from landing_fees table) plus PHX period_a fee of 137 (from landing_fees table) equals 263. Same calculation would be made for each origin/destination/period combination for each date.

Final output would look like this:

    origin  destination period  20160101_cost 20160102_cost 20160103_cost
1:  LAX          PHX    period_a    263          308            359
2:  LAX          HNL    period_a    282          375            355
3:  LAX          PHX    period_b    226          304            344
4:  LAX          HNL    period_c    330          232            226
5:  PHX          LAX    period_a    263          308            359
6:  PHX          HNL    period_a    293          305            320
7:  PHX          LAX    period_b    226          304            344
8:  PHX          HNL    period_c    369          275            237

Since the datasets can be fairly large, I am hoping to use efficient data.table script. I am using data.table version 1.9.6 on Windows 7 machine.

Upvotes: 0

Views: 81

Answers (1)

Chris
Chris

Reputation: 6372

You can do this using melt and cast from data.table.

This flattens your cost matrix for merging:

lf_flat <- melt(landing_fees, id.vars = c("date", "airport"), variable.name = "period")[, period := gsub("_fee", "", period)]

Then merge to the routing table, and collapse cost

DT <- merge(routes, lf_flat,by.x = c("origin","period"), by.y = c("airport","period"))
DT <- merge(DT, lf_flat, by.x = c("destination","period","date"), by.y = c("airport","period","date"))
DT[, cost := value.x + value.y][,(c("value.x","value.y")) := NULL]

Which gives you the consolidated table:

DT
    destination   period     date origin cost
 1:         HNL period_a 20160101    LAX  282
 2:         HNL period_a 20160101    PHX  293
 3:         HNL period_a 20160102    LAX  375
 4:         HNL period_a 20160102    PHX  305
 5:         HNL period_a 20160103    LAX  355
 6:         HNL period_a 20160103    PHX  320
 7:         HNL period_c 20160101    LAX  330
 8:         HNL period_c 20160101    PHX  369
 9:         HNL period_c 20160102    LAX  232
10:         HNL period_c 20160102    PHX  275
11:         HNL period_c 20160103    LAX  226
12:         HNL period_c 20160103    PHX  237
13:         LAX period_a 20160101    PHX  263
14:         LAX period_a 20160102    PHX  308
15:         LAX period_a 20160103    PHX  359
16:         LAX period_b 20160101    PHX  226
17:         LAX period_b 20160102    PHX  304
18:         LAX period_b 20160103    PHX  344
19:         PHX period_a 20160101    LAX  263
20:         PHX period_a 20160102    LAX  308
21:         PHX period_a 20160103    LAX  359
22:         PHX period_b 20160101    LAX  226
23:         PHX period_b 20160102    LAX  304
24:         PHX period_b 20160103    LAX  344
    destination   period     date origin cost

I would personally leave it like this, as it gives you the same data as your desired matrix, but with more flexibility. To get your desired state, use:

dcast(DT, origin + destination + period ~ date, value.var = "cost")
   origin destination   period 20160101 20160102 20160103
1:    LAX         HNL period_a      282      375      355
2:    LAX         HNL period_c      330      232      226
3:    LAX         PHX period_a      263      308      359
4:    LAX         PHX period_b      226      304      344
5:    PHX         HNL period_a      293      305      320
6:    PHX         HNL period_c      369      275      237
7:    PHX         LAX period_a      263      308      359
8:    PHX         LAX period_b      226      304      344

Upvotes: 3

Related Questions