Reputation: 103
This question is based on answers of question here: Join based on a concatenated 'origin-destination' variable and consecutive individual sites
I have a for loop that needs to iterate over +100 thousand rows, and performs too slow. The main task is to iterate over each row of pairs
and find the Vonat_ID variable in STATIONS
dataframe, then with the match
function find the stations in STATIONS
that go from HONNAN variable to HOVA variable from the iterating row in pairs
.
This is the pairs dataframe. (167 thousand rows)
head(pairs)
# A tibble: 6 x 4
# Groups: Vonat_ID [2]
Vonat_ID HONNAN HOVA OD_pár
<chr> <chr> <chr> <chr>
1 20210216-40517 Hegyeshalom Hegyeshalom oh. Hegyeshalom -- Hegyeshalom oh.
2 20210216-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
3 20210216-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
4 20210223-40517 Püspökladány Ferencváros Püspökladány -- Ferencváros
5 20210223-40517 Biharkeresztes oh. Püspökladány Biharkeresztes oh. -- Püspökladány
6 20210223-40517 Püspökladány Hegyeshalom oh. Püspökladány -- Hegyeshalom oh.
This is the STATIONS dataframe. (2.7 millions rows)
head(STATIONS)
# A tibble: 6 x 2
# Groups: Vonat_ID [1]
Vonat_ID Állomás
<chr> <chr>
1 20210216-40517 Biharkeresztes oh.
2 20210216-40517 Biharkeresztes
3 20210216-40517 Mezőpeterd
4 20210216-40517 Berettyóújfalu
5 20210216-40517 Sáp
6 20210216-40517 Báránd
...
20210216-40517 Öttevény
55 20210216-40517 Lébény-Mosonszentmiklós
56 20210216-40517 Kimle-Károlyháza
57 20210216-40517 Mosonmagyaróvár
58 20210216-40517 Hegyeshalom
59 20210216-40517 Hegyeshalom oh.
60 20210223-40517 Biharkeresztes oh.
61 20210223-40517 Biharkeresztes
62 20210223-40517 Mezőpeterd
63 20210223-40517 Berettyóújfalu
64 20210223-40517 Sáp
65 20210223-40517 Báránd
...
88 20210223-40517 Öttevény
89 20210223-40517 Lébény-Mosonszentmiklós
90 20210223-40517 Kimle-Károlyháza
91 20210223-40517 Mosonmagyaróvár
92 20210223-40517 Hegyeshalom
93 20210223-40517 Hegyeshalom oh.
...
I created vectors of each variable from pairs
to speed up the for loop. pairs_ID
, pairs_HN
, pairs_HV
and pairs_OD
, respectively.
The desired output looks like this: (stat is all the "Állomás" from STATIONS
between the "OD_pár" starting and endpoint. (HONNAN means FROM, HOVA means TO))
OD stat
1 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom #"OD_pár" from the first row of `pairs`
2 Hegyeshalom -- Hegyeshalom oh. Hegyeshalom oh.
3 Biharkeresztes oh. -- Püspökladány Biharkeresztes oh.#"OD_pár" from the second row of `pairs`
4 Biharkeresztes oh. -- Püspökladány Biharkeresztes
5 Biharkeresztes oh. -- Püspökladány Mezőpeterd
6 Biharkeresztes oh. -- Püspökladány Berettyóújfalu
The for loop I am using is the following. I need a huge dataframe at the end with all the matching ODs and their station names, hence the list appending at the end of the loop. Also, I added an error catch function in case any appear.
bridge_total <- list()
for (x in 1:nrow(pairs)) {
tryCatch({
OD <- pairs_OD[x]
stat <- STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]][match(pairs_HN[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]]) :
match(pairs_HV[x], STATIONS[STATIONS$Vonat_ID==pairs_ID[x],][[2]])]
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
df <- data.frame("OD"=OD, "stat"=stat)
bridge_total[[x]] <- df
print(x)
}
bridge_total <- do.call(rbind, bridge_total)
Is there a way I could speed up the loop even better? right now as of test runs, to iterate over +100k rows, it requires 2-3 hours minimum.
Upvotes: 0
Views: 72
Reputation: 11878
If I understood correctly, this small dataset should reproduce the setup:
library(data.table)
journeys <- rbind(
data.table(line = 1, origin = "A", destination = "C"),
data.table(line = 2, origin = "B", destination = "E"),
data.table(line = 2, origin = "D", destination = "E")
)
journeys
#> line origin destination
#> 1: 1 A C
#> 2: 2 B E
#> 3: 2 D E
stops <- rbind(
data.table(line = 1, station = c("A", "B", "C")),
data.table(line = 2, station = c("D", "B", "E"))
)
stops
#> line station
#> 1: 1 A
#> 2: 1 B
#> 3: 1 C
#> 4: 2 D
#> 5: 2 B
#> 6: 2 E
And now the goal is to find all stops that constitute a journey.
Instead of looping, it may be faster to first use joins to find the origin
and destination indices in the stops
data frame:
setkey(journeys, line)
setkey(stops, line)
stops[, idx := .I]
journeys[stops, oidx := idx, on = .(line, origin = station)]
journeys[stops, didx := idx, on = .(line, destination = station)]
journeys[, len := 1L + didx - oidx]
journeys
#> line origin destination oidx didx len
#> 1: 1 A C 1 3 3
#> 2: 2 B E 5 6 2
#> 3: 2 D E 4 6 3
Then extract the corresponding stops:
journeys[, cbind(journey = rep(.I, len), stops[sequence(len, oidx)])]
#> journey line station idx
#> 1: 1 1 A 1
#> 2: 1 1 B 2
#> 3: 1 1 C 3
#> 4: 2 2 B 5
#> 5: 2 2 E 6
#> 6: 3 2 D 4
#> 7: 3 2 B 5
#> 8: 3 2 E 6
Upvotes: 1