Ahmad Senousi
Ahmad Senousi

Reputation: 633

Pandas: Shortest path length between large pairs of nodes

I have a data frame contain orgin_nodes and Distination_nodes as follows: enter image description here

I need to compute short_path_length between those nodes using networkx library by applying the next function:

def short_path_length (node1,node2):
    return nx.shortest_path_length(G, node1, nod2,weight='length')

df['short_path_length']=np.vectorize(short_length_nodes)(df['Orgin_nodes'],df['Destination_nodes'])

Where G is the network graph derived from osmnx library: I applied this code to sample of Dataframes, the resulting as follow:

enter image description here

When I applied it for orginal dataframe with around 3000000 rows it take more time?

Is there a way to make the running Faster?

update1:

I followed @gboeing answer and I converted the networkx graph to igraph as follows (https://github.com/gboeing/osmnx-examples/blob/master/notebooks/18-osmnx-to-igraph.ipynb):

ox.config(use_cache=True, log_console=True)
weight = 'length'
G_nx = nx.relabel.convert_node_labels_to_integers(G)
# convert networkx graph to igraph
G_ig = ig.Graph(directed=True)
G_ig.add_vertices(list(G_nx.nodes()))
G_ig.add_edges(list(G_nx.edges()))
G_ig.vs['osmid'] = list(nx.get_node_attributes(G_nx, 'osmid').values())
G_ig.es[weight] = list(nx.get_edge_attributes(G_nx, weight).values())



def short_path_length(node1,node2):
        return G_ig.shortest_paths(source=node1,target=node2, weights=weight)[0][0]


df['short_path_length'] = df.apply(short_path_length(df['Orgin_nodes'],df['Destination_nodes']), axis=1)

I obtained this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>()

<timed exec> in short_path_length(node1, node2)

ValueError: vertex IDs must be positive, got: -1 

the cause of this error is that the nodes number in the df['Orgin_nodes'],df['Destination_nodes'] didn't matches with G_ig vertices names. What should I do to solve it?

update2

I solved the above problem By creating datframe contain G_nx.nodes and its corresponding OSMid values and replaced the Orgin_nodes and Destination_nodes by the G_nx.nodes as follows:

df_indices_osmid_Orgin=pd.DataFrame.from_dict({'Orgin_nodes':list(nx.get_node_attributes(G_nx, 'osmid').values()),'Indecise_Nodes_Orgin':list(G_nx.nodes())})
df=pd.merge(df,df_indices_osmid_Orgin,how='inner',on='Orgin_nodes')
df_indices_osmid_Dest=pd.DataFrame.from_dict({'Destination_nodes':list(nx.get_node_attributes(G_nx, 'osmid').values()),'Indecise_Nodes_Dest':list(G_nx.nodes())})
df=pd.merge(df,df_indices_osmid_Dest,how='inner',on='Destination_nodes')

and apply the following function sample of df to measure the shortest distance:

sampl_df=df.head()
def short_path_length(row):
    return G_ig.shortest_paths(source=row['Indecise_Nodes_Orgin'], target=row['Indecise_Nodes_Dest'], weights=weight)[0][0]
sampl_df['short_path_length_1'] = sampl_df.apply(short_path_length, axis=1)

Although it running without error, it took longer time compared to the previous trial:

sampl_df=df.head()
%%time
    def short_path_length(row):
        return G_ig.shortest_paths(source=row['Indecise_Nodes_Orgin'], target=row['Indecise_Nodes_Dest'], weights=weight)[0][0]
sampl_df['short_path_length_1'] = sampl_df.apply(short_path_length, axis=1)

Wall time: 2.89 s

2.88 s ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%time
def short_path_length(row):
    return nx.shortest_path_length(G, row['Orgin_nodes'], row['Destination_nodes'], weight='length')
sampl_df['short_path_length_2'] = sampl_df.apply(short_path_length, axis=1)

Wall time: 1.24 s

1.2 s ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%time
def short_path_length (node1,node2):
     return nx.shortest_path_length(G, node1, node2,weight='length')

sampl_df['short_path_length_intr3']=np.vectorize(short_path_length)(sampl_df['Orgin_nodes'],sampl_df['Destination_nodes'])

Wall time: 1.2 s

1.21 s ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it can be noted that the third is the best or this is not the scale for identifying which of them running faster.

Upvotes: 4

Views: 1943

Answers (1)

gboeing
gboeing

Reputation: 6412

This is an inherently non-vectorizable problem as you are passing in node labels and computing shortest paths between them algorithmically with a graph object. You might get a minor speed-up by simplifying your code:

def short_path_length(row):
    return nx.shortest_path_length(G, row['Orgin_nodes'], row['Destination_nodes'], weight='length')
df['short_path_length'] = df.apply(short_path_length, axis=1)

For greater speed-up, export your OSMnx graph to igraph to compute shortest paths fast in C, as demonstrated in notebook 18 in the OSMnx examples.

Upvotes: 3

Related Questions