Reputation: 469
I have two data tables as follows:
set.seed(1)
landing_fees <- data.table(`date`= c(20160101,20160101,20160101,
20160102,20160102,20160102,20160103,20160103,20160103),
airport=c("LAX","PHX","HNL"), period_a_fee=sample(100:200,9),
period_b_fee=sample(100:200,9), period_c_fee=sample(100:200,9))
routes <- data.table(origin=c(rep("LAX",4),rep("PHX",4)),
destination=c("PHX","HNL","PHX","HNL","LAX","HNL","LAX","HNL"),
period=rep(c("period_a","period_a","period_b","period_c"),2))
Which output as follows:
> landing_fees
date airport period_a_fee period_b_fee period_c_fee
1: 20160101 LAX 126 106 138
2: 20160101 PHX 137 120 177
3: 20160101 HNL 156 117 192
4: 20160102 LAX 189 167 120
5: 20160102 PHX 119 137 163
6: 20160102 HNL 186 173 112
7: 20160103 LAX 197 147 125
8: 20160103 PHX 162 197 136
9: 20160103 HNL 158 192 101
> routes
origin destination period
1: LAX PHX period_a
2: LAX HNL period_a
3: LAX PHX period_b
4: LAX HNL period_c
5: PHX LAX period_a
6: PHX HNL period_a
7: PHX LAX period_b
8: PHX HNL period_c
The real datasets are huge with millions of origin/destination combinations and years worth of dates. I would like to add a new column to the routes table for each unique date in landing_fees$date and programatically name the column with the date and _cost (ie 20160101_cost). Then in each new column, sum the fee for origin and the fee for the destination (using matching time period) as listed in the landing_fees table. For instance, in the routes table, a new column routes$20160101_cost would be created and the value of row 1 would be: LAX period_a fee of 126 (from landing_fees table) plus PHX period_a fee of 137 (from landing_fees table) equals 263. Same calculation would be made for each origin/destination/period combination for each date.
Final output would look like this:
origin destination period 20160101_cost 20160102_cost 20160103_cost
1: LAX PHX period_a 263 308 359
2: LAX HNL period_a 282 375 355
3: LAX PHX period_b 226 304 344
4: LAX HNL period_c 330 232 226
5: PHX LAX period_a 263 308 359
6: PHX HNL period_a 293 305 320
7: PHX LAX period_b 226 304 344
8: PHX HNL period_c 369 275 237
Since the datasets can be fairly large, I am hoping to use efficient data.table script. I am using data.table version 1.9.6 on Windows 7 machine.
Upvotes: 0
Views: 81
Reputation: 6372
You can do this using melt and cast from data.table
.
This flattens your cost matrix for merging:
lf_flat <- melt(landing_fees, id.vars = c("date", "airport"), variable.name = "period")[, period := gsub("_fee", "", period)]
Then merge to the routing table, and collapse cost
DT <- merge(routes, lf_flat,by.x = c("origin","period"), by.y = c("airport","period"))
DT <- merge(DT, lf_flat, by.x = c("destination","period","date"), by.y = c("airport","period","date"))
DT[, cost := value.x + value.y][,(c("value.x","value.y")) := NULL]
Which gives you the consolidated table:
DT
destination period date origin cost
1: HNL period_a 20160101 LAX 282
2: HNL period_a 20160101 PHX 293
3: HNL period_a 20160102 LAX 375
4: HNL period_a 20160102 PHX 305
5: HNL period_a 20160103 LAX 355
6: HNL period_a 20160103 PHX 320
7: HNL period_c 20160101 LAX 330
8: HNL period_c 20160101 PHX 369
9: HNL period_c 20160102 LAX 232
10: HNL period_c 20160102 PHX 275
11: HNL period_c 20160103 LAX 226
12: HNL period_c 20160103 PHX 237
13: LAX period_a 20160101 PHX 263
14: LAX period_a 20160102 PHX 308
15: LAX period_a 20160103 PHX 359
16: LAX period_b 20160101 PHX 226
17: LAX period_b 20160102 PHX 304
18: LAX period_b 20160103 PHX 344
19: PHX period_a 20160101 LAX 263
20: PHX period_a 20160102 LAX 308
21: PHX period_a 20160103 LAX 359
22: PHX period_b 20160101 LAX 226
23: PHX period_b 20160102 LAX 304
24: PHX period_b 20160103 LAX 344
destination period date origin cost
I would personally leave it like this, as it gives you the same data as your desired matrix, but with more flexibility. To get your desired state, use:
dcast(DT, origin + destination + period ~ date, value.var = "cost")
origin destination period 20160101 20160102 20160103
1: LAX HNL period_a 282 375 355
2: LAX HNL period_c 330 232 226
3: LAX PHX period_a 263 308 359
4: LAX PHX period_b 226 304 344
5: PHX HNL period_a 293 305 320
6: PHX HNL period_c 369 275 237
7: PHX LAX period_a 263 308 359
8: PHX LAX period_b 226 304 344
Upvotes: 3