SteveS
SteveS

Reputation: 4040

Draw edges between nodes based on similarity using NetworkX?

Here is my toy nodes dataframe:

    import pandas as pd
    
    df = pd.DataFrame({
        'id': [1, 2, 3, 4, 5],
        'a': [55, 2123, -19.3, 9, -8], 
        'b': ['aa', 'bb', 'ad', 'kuku', 'lulu']
    })

I am building a Graph with the nodes (each row of the df is a node with id and attributes):

    import networkx as nx
    G = nx.Graph()
    
    for i, attr in df.set_index('id').iterrows():
        G.add_node(i, **attr.to_dict())

Now I want to connect these nodes using nodes similarity (cosine or any other distance function). Questions:

  1. Can I do nodes similarity with mixed types and apply different distance metrics for each type?
  2. If my node's attributes are all numbers, how can I calculate the similarity between any 2 nodes in my graph and draw an edge if similarity between node 1 and 2 is above some threshold alpha?

For question 2 consider my above df is:

    df = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

Upvotes: 1

Views: 784

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

AFAIK, networkx does not implement calculation of similarity, so that will have to be calculated outside networkx.

For question 1, given the mixed data types, I can recommend recordlinkage. Using this library you can implement a logic for what combination of numeric/string variables is considered 'similar'.

For question 2, if the data is all numeric, then using sklearn's pairwise distances is appropriate (as of version 1.0.2, it does not support string dtype, so for string variables recordlinkage/another string processing library or a custom pipeline is needed). Something along these lines:

import networkx as nx
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "a": [55, 2123, -19.3, 9, -8],
        "b": ["aa", "bb", "ad", "kuku", "lulu"],
    }
)

dist_a = pairwise_distances(df[["a"]], metric="euclidean")

# form links if distance is lower than some threshold
ix_a, ix_b = np.where(dist_a < 70)

# add nodes
G = nx.Graph()
for source, target in zip(ix_a, ix_b):
    G.add_edge(source, target)

For handling multiple columns (and distances), one will need to integrate some logic on how to combine (and possible weigh/normalize) different distances.

Upvotes: 1

Related Questions