Andreas
Andreas

Reputation: 167

R: How could I reduce runtime of osmdata to an igraph conversion

Is it possible to reduce the run time of the following code?

My goal is to get an weighted igraph object from open street data area specified by a box boundary.

Currently im trying to use the overpass api so to offload the memory load so I dont have to keep big osm files in memory.

First I get a osm data specified by a bbox (only streets) as an xml structure

library(osmdata)
library(osmar)
install.packages("remotes")
remotes::install_github("hypertidy/scgraph")
library(scgraph)


dat <- opq(bbox = c(11.68771, 47.75233, 12.35058, 48.19743 )) %>%
  add_osm_feature(key = 'highway',value = c("trunk", "trunk_link", "primary","primary_link", "secondary", "secondary_link", "tertiary","tertiary_link", "residential", "unclassified" ))%>% 
  osmdata_xml ()

Then I convert the resulting xml object dat to an osmar object dat_osmar and finally to an igraph object:

dat_osmar <-as_osmar(xmlParse(dat))
dat_graoh <- as_igraph(dat_osmar)

How could I optimize these routines?

Maybe it is possible to separate dat (XML) object in chunks and parse it in parallell?

I go through several steps only to finally to get to a weighted non directed graph.

Currently the whole process takes 89.555 sec on my machine.

If I could cut the running time of these two stepps:

dat_osmar <-as_osmar(xmlParse(dat))
dat_graoh <- as_igraph(dat_osmar)

that would help already.

One of the approaches I tried is to use osmdata_sc() instead of osmdata_xml().

This provides an silicate object and I can convert it with:

scgraph::sc_as_igraph(dat)

to an igraph.

It is decently fast but sadly the weights are getting lost so its not a solution.

The reason for it is: if I use the conversion from osmar object to an igraph object with the function osmar::as_igraph() the weight is calculated based on the distances between two edges and added to the igraph:

    edges <- lapply(dat, function(x) {
    n <- nrow(x)
    from <- 1:(n - 1)
    to <- 2:n
    weights <- distHaversine(x[from, c("lon", "lat")], x[to, 
      c("lon", "lat")])
    cbind(from_node_id = x[from, "ref"], to_node_id = x[to, 
      "ref"], way_id = x[1, "id"], weights = weights)
  })

This is missing from scgraph::sc_as_igraph(dat)

If this could be added to silicate to igraph conversion I could skip the dat_osmar <-as_osmar(xmlParse(dat)) step and go overpass->silicate->igraph route which is much faster istead of overpass->xml->osmar->igraph.

osmdata package also provides a sf response with osmdata_sf()

so maybe the workflow overpass->sf->igraph is faster but also while using this way I would need the weights incorporated into the graph based on the distance of edges and im not good enough to do it currently and would realy appreciate any help.

Furthermore the connection between openstreetmap gps points and their IDs should not be lost while using sf and resulting igraph object. Meaning I should be able to find gps position to an ID from the resulting Igraph. A lookup table would be enough. If i go overpass->silicate->igraph or overpass->xml->osmar->igraph routes it is possible. Im not sure if it is still would be posible with overpass->sf->igraph route.

Upvotes: 5

Views: 898

Answers (2)

agila
agila

Reputation: 3482

If you want to create a graph object starting from a network of roads in R, then I would use the following procedure.

First of all, I need to install sfnetworks from the github repo (since we recently fixed some bugs and the newest version is not on CRAN)

remotes::install_github("luukvdmeer/sfnetworks", quiet = TRUE)

Then load packages

library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(tidygraph)
#> 
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(sfnetworks)
library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright

Download data from Overpass API

my_osm_data <- opq(bbox = c(11.68771, 47.75233, 12.35058, 48.19743 )) %>%
  add_osm_feature(
    key = 'highway', 
    value = c("trunk", "trunk_link", "primary","primary_link", "secondary", "secondary_link", "tertiary","tertiary_link", "residential", "unclassified")
  ) %>% 
  osmdata_sf(quiet = FALSE)
#> Issuing query to Overpass API ...
#> Rate limit: 2
#> Query complete!
#> converting OSM data to sf format

Now I extract the roads and build the sfnetwork object:

system.time({
  # extract the roads
  my_roads <- st_geometry(my_osm_data$osm_lines)
  
  # build the sfnetwork object
  my_sfn <- as_sfnetwork(my_roads, directed = FALSE, length_as_weight = TRUE)
})
#>    user  system elapsed 
#>    3.03    0.16    3.28

As you can see, after downloading the OSM data, it takes just a couple of seconds to run that procedure.

At the moment I ignore all fields in my_osm_data$osm_lines, but if you need to add some of the columns in my_osm_data$osm_lines to my_roads, then you can modify the previous code as follows: my_roads <- my_osm_data$osm_lines[, "relevant columns"].

A few details regarding the construction of the sfnetwork object: the parameter "directed = FALSE" specifies that we want to build an undirected graph (see the docs, here and here for more details), while the parameter length_as_weight = TRUE says that the length of the edges will be stored in a column called "weight" and used by igraph and tidygraph algorithms.

This is the printing of my_sfn object:

my_sfn
#> # A sfnetwork with 33179 nodes and 28439 edges
#> #
#> # CRS:  EPSG:4326 
#> #
#> # An undirected multigraph with 6312 components with spatially explicit edges
#> #
#> Registered S3 method overwritten by 'cli':
#>   method     from         
#>   print.boxx spatstat.geom
#> # Node Data:     33,179 x 1 (active)
#> # Geometry type: POINT
#> # Dimension:     XY
#> # Bounding box:  xmin: 11.6757 ymin: 47.74745 xmax: 12.39161 ymax: 48.22025
#>                     x
#>           <POINT [°]>
#> 1 (11.68861 47.90971)
#> 2 (11.68454 47.90937)
#> 3 (11.75216 48.17638)
#> 4 (11.75358 48.17438)
#> 5  (11.7528 48.17351)
#> 6 (11.74822 48.17286)
#> # ... with 33,173 more rows
#> #
#> # Edge Data:     28,439 x 4
#> # Geometry type: LINESTRING
#> # Dimension:     XY
#> # Bounding box:  xmin: 11.6757 ymin: 47.74745 xmax: 12.39161 ymax: 48.22025
#>    from    to                                                           x weight
#>   <int> <int>                                            <LINESTRING [°]>  <dbl>
#> 1     1     2 (11.68861 47.90971, 11.6878 47.90965, 11.68653 47.90954, 1~   306.
#> 2     3     4 (11.75216 48.17638, 11.75224 48.17626, 11.75272 48.17556, ~   246.
#> 3     5     6 (11.7528 48.17351, 11.75264 48.17344, 11.75227 48.17329, 1~   382.
#> # ... with 28,436 more rows

my_sfn is an igraph object by definition:

class(my_sfn)
#> [1] "sfnetwork" "tbl_graph" "igraph"

but, if you want to be more explicit, then

as.igraph(my_sfn)
#> IGRAPH 101dcdf U-W- 33179 28439 -- 
#> + attr: x (v/x), x (e/x), weight (e/n)
#> + edges from 101dcdf:
#>  [1]   1--  2   3--  4   5--  6   7--  8   9-- 10  11-- 12  13-- 14  15-- 16
#>  [9]  17-- 18  16-- 19  20-- 21  21-- 22  23-- 24  25-- 26  27-- 28  29-- 30
#> [17]  31-- 32  33-- 34  35-- 36  37-- 38  39-- 40  41-- 42  43-- 44  45-- 46
#> [25]  14-- 47  48-- 49  50-- 51  52-- 53  54-- 55  56-- 57  36-- 58  58-- 59
#> [33]  60-- 61  62-- 63  64-- 65  66-- 67  68-- 69  70-- 71  72-- 73  74-- 75
#> [41]  76-- 77  78-- 79  80-- 81  82-- 83  84-- 85  86-- 87  88-- 89  90-- 91
#> [49]  92-- 93  94-- 95  96-- 97  98-- 99 100--101 102--103 104--105 106--107
#> [57] 108--109 110--111 112--113  80--114 115--116 117--118 119--120 121--122
#> + ... omitted several edges

You can see that the edges have a weight attribute that is equal to the length of each LINESTRING geometry:

all.equal(
  target = igraph::edge_attr(as.igraph(my_sfn), "weight"), 
  current = as.numeric(st_length(my_roads))
)
#> [1] TRUE

Created on 2021-03-26 by the reprex package (v1.0.0)

If you want to read more details regarding sfnetworks, then you can check the website and the introductory vignettes. That being said, I don't understand what you mean by

connection between openstreetmap gps points and their IDs should not be lost

Can you add more details with a comment or an edit to the original question? Why do you need to OSM id? And what do you mean by OSM id? I think that I need more details to expand this answer.

EDIT

I just re-read @mrhellmann 's answer, and I noticed that I forgot to convert POLYGON data to lines. Anyway, I would suggest applying osmdata::osm_poly2line() immediately after running the code to download OSM data via Overpass API.

Upvotes: 3

mrhellmann
mrhellmann

Reputation: 5529

It seems like getting the xml data into another format is taking a long time. Instead of using xml, asking overpass to return an sf object and using that might be quicker. The sf object can then be manipulated & used by the sfnetworks package to get a network, while retaining the spatial aspects of the network. Weights can be added by functions from sfnetworks (or tidygraph) packages.

I think the code below focuses on taking care of the speed and the edge weight problems. Other problems, like finding the nearest nodes or edges, can be solved using the functions of the sf package, but are not addressed. Otherwise this becomes more than a one-off SO question.

The speed might be able to be increased, at the cost of spatial accuracy, by using st_simplify for the edges. One problem with this approach is that stnetworks places a node where each linestring meets another. The data returned often has a single roadway split into multiple linestrings. As an example, see two longer roads in yellow on the edges plot below. Probably a solvable problem, but may not be worthwhile in this case.

library(osmdata)
#library(osmar)
library(tidyverse)
library(sf)
library(ggplot2)
library(sfnetworks)
library(tidygraph)


# get data as an sf object rather than xml
## This is the slowest part of the code.
dat_sf <- opq(bbox = c(11.68771, 47.75233, 12.35058, 48.19743 )) %>%
  add_osm_feature(key = 'highway',value = c("trunk", "trunk_link", "primary","primary_link", "secondary", "secondary_link", "tertiary","tertiary_link", "residential", "unclassified" ))%>% 
  osmdata_sf()



# Only keep lines & polygons. Points take up too much memory & 
## all seem to be on lines anyway. Change polygons to LINESTRING,
## as most seem to be roundabouts or a few odd cases.
lines_sf <- dat_sf$osm_lines %>% select(osm_id) %>% st_sf()
polys_sf <- dat_sf$osm_polygons %>% select(osm_id) %>% st_sf() %>% 
  st_cast('LINESTRING')

# Combine the two above sf objects into one 
dat_sf_bound <- rbind(lines_sf, polys_sf)

# get an sfnetwork
dat_sf_net <- as_sfnetwork(dat_sf_bound)

# add edge weight as distance
dat_sf_net <- dat_sf_net %>%
  activate(edges) %>%
  mutate(weight = edge_length())

dat_sf_net should look like:

> dat_sf_net
# An sfnetwork with 33255 nodes and 28608 edges
#
# CRS:  EPSG:4326 
#
# A directed multigraph with 6391 components with spatially explicit edges
#
# Edge Data:     28,608 x 4 (active)
# Geometry type: LINESTRING
# Dimension:     XY
# Bounding box:  xmin: 11.6757 ymin: 47.74745 xmax: 12.39161 ymax: 48.22025
   from    to   weight                                                                  geometry
  <int> <int>      [m]                                                          <LINESTRING [°]>
1     1     2 306.3998 (11.68861 47.90971, 11.6878 47.90965, 11.68653 47.90954, 11.68597 47.909…
2     3     4 245.9225 (11.75216 48.17638, 11.75224 48.17626, 11.75272 48.17556, 11.7528 48.175…
3     5     6 382.2423 (11.7528 48.17351, 11.75264 48.17344, 11.75227 48.17329, 11.752 48.1732,…
4     7     8 131.1373 (11.70029 47.87861, 11.70046 47.87869, 11.70069 47.87879, 11.70138 47.87…
5     9    10 252.9170 (11.75733 48.17339, 11.75732 48.17343, 11.75726 48.17357, 11.75718 48.17…
6    11    12 131.6942 (11.75582 48.17036, 11.75551 48.1707, 11.75521 48.17106, 11.75507 48.171…
# … with 28,602 more rows
#
# Node Data:     33,255 x 1
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: 11.6757 ymin: 47.74745 xmax: 12.39161 ymax: 48.22025
             geometry
          <POINT [°]>
1 (11.68861 47.90971)
2 (11.68454 47.90937)
3 (11.75216 48.17638)
# … with 33,252 more rows

Plots of just edges, and of edges with nodes: enter image description here

A few long roads skew the coloring, but illustrate the a single road split in two.

Edit: Addressing the comment to select nearest edge(road) by latitude / longitude coordinates. Nodes(intersections / red dots above) do not have osm id numbers that I am aware of. Nodes are created by sfnetworks.

Starting with an sf object of lat/lon point as our made up gps coordinate.

# random point
gps <- sfheaders::sf_point(data.frame(x = 11.81854, y = 48.04514)) %>% st_set_crs(4326)

# nearest edge(road) to the point. dat_sf_net must have edges activated.
near_edge <- st_nearest_feature(gps, dat_sf_net %>% st_as_sf())

>near_edge
[1] 4359

> st_as_sf(dat_sf_net)[near_edge,]
Simple feature collection with 1 feature and 4 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 11.81119 ymin: 48.02841 xmax: 11.82061 ymax: 48.06845
Geodetic CRS:  WGS 84
# A tibble: 1 x 5
   from    to osm_id                                                           geometry   weight
  <int> <int> <chr>                                                    <LINESTRING [°]>      [m]
1  7590  7591 24232418 (11.81289 48.02841, 11.81213 48.03014, 11.81202 48.03062, 11.81… 4576.273

p3 <- gplot() +
      geom_sf(data = st_as_sf(dat_sf_net), color = 'black') +       
      geom_sf(data = gps, color = 'red') + 
      geom_sf(data = st_as_sf(dat_sf_net)[near_edge,], color = 'orange') + 
      coord_sf(xlim = c(11.7, 11.9), ylim = c(48, 48.1))

enter image description here

It looks like @agila commented above. Since he's an author of sfnetworks, maybe he will have some insights.

Upvotes: 2

Related Questions