finding relations between co-concurrent data

Question

I have a dataframe that looks like graph database.

import pandas as pd
mycols=['china', 'england', 'france', 'india', 'pakistan', 'taiwan']

df=pd.DataFrame([[0, 0, 0, 3, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [0, 1, 0, 1, 0, 0],
       [3, 1, 1, 0, 1, 0],
       [0, 0, 0, 1, 0, 4],
       [0, 0, 0, 0, 4, 0]], columns=mycols)

df.index=mycols

The simplified dummy dataframe looks like this:

           china    england france  india   pakistan    taiwan
china          0          0      0      3          0    0
england        0          0      1      1          0    0
france         0          1      0      1          0    0
india          3          1      1      0          1    0
pakistan       0          0      0      1          0    4
taiwan         0          0      0      0          4    0

Let's assume a user want to go from china to india, there is direct route.

df[df['china'] > 0].index.str.contains('india')
array([ True])

But no direct route to england:

df[df['china'] > 0].index.str.contains('england')
array([False])

In that case, I need to find the common country:

set(df[df.loc['china'] > 0].index.values) & set(df[df.loc['england'] > 0].index.values)
{'india'}

But there are cases where there is no common friend, and I need to find friend of friend to reach the destination. for e.g.

set(df[df.loc['china'] > 0].index.values) & set(df[df.loc['taiwan'] > 0].index.values)

1) In this case how do I write a query that will return china - india - pakistan - taiwan ?

2) Is there any better way to store this? Or SQL like (rows / columns) is ok?

Gambit1614 · Accepted Answer

You can do this using Networkx in the following way

Load the graph

import pandas as pd
import networkx as nx
mycols=['china', 'england', 'france', 'india', 'pakistan', 'taiwan']

df=pd.DataFrame([[0, 0, 0, 3, 0, 0],
   [0, 0, 1, 1, 0, 0],
   [0, 1, 0, 1, 0, 0],
   [3, 1, 1, 0, 1, 0],
   [0, 0, 0, 1, 0, 4],
   [0, 0, 0, 0, 4, 0]], columns=mycols)

#Load the graph from dataframe
G = nx.from_numpy_matrix(df.values)

#set the nodes names
G = nx.relabel_nodes(graph, dict(enumerate(mycols)))

Test if the graph is correctly loaded

print G.edges()
#EdgeView([('pakistan', 'taiwan'), ('pakistan', 'india'), ('england', 'india'), ('england', 'france'), ('india', 'china'), ('india', 'france')])

print graph['china']
#AtlasView({'india': {'weight': 3}})

print graph['england']
#AtlasView({'india': {'weight': 1}, 'france': {'weight': 1}})

Now suppose you need to find all path from china to india

for path in nx.all_simple_paths(graph, source='china', target='taiwan'):
    print path
#Output : ['china', 'india', 'pakistan', 'taiwan']

If you want to find shortest paths from one node to another

for path in nx.all_shortest_paths(graph, source='taiwan', target='india'):
    print path
#Output : ['taiwan', 'pakistan', 'india']

You can find multiple other algorithms for finding the shortext path, all-pair shortest path, dijsktra algorithm, etc. at their documentation to suit your queries

Note there might exist a way to load the graph directly from pandas using from_pandas_dataframe, but I was not sure if the use case was correct, as it requires a source and target

finding relations between co-concurrent data

Answers (2)

Related Questions