Reputation: 4040
Here is my toy nodes dataframe:
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': ['aa', 'bb', 'ad', 'kuku', 'lulu']
})
I am building a Graph with the nodes (each row of the df is a node with id and attributes):
import networkx as nx
G = nx.Graph()
for i, attr in df.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
Now I want to connect these nodes using nodes similarity (cosine or any other distance function). Questions:
For question 2 consider my above df is:
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
Upvotes: 1
Views: 784
Reputation: 16561
AFAIK, networkx
does not implement calculation of similarity, so that will have to be calculated outside networkx.
For question 1, given the mixed data types, I can recommend recordlinkage. Using this library you can implement a logic for what combination of numeric/string variables is considered 'similar'.
For question 2, if the data is all numeric, then using sklearn's pairwise distances is appropriate (as of version 1.0.2, it does not support string dtype, so for string variables recordlinkage
/another string processing library or a custom pipeline is needed). Something along these lines:
import networkx as nx
import numpy as np
import pandas as pd
from sklearn.metrics import pairwise_distances
df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5],
"a": [55, 2123, -19.3, 9, -8],
"b": ["aa", "bb", "ad", "kuku", "lulu"],
}
)
dist_a = pairwise_distances(df[["a"]], metric="euclidean")
# form links if distance is lower than some threshold
ix_a, ix_b = np.where(dist_a < 70)
# add nodes
G = nx.Graph()
for source, target in zip(ix_a, ix_b):
G.add_edge(source, target)
For handling multiple columns (and distances), one will need to integrate some logic on how to combine (and possible weigh/normalize) different distances.
Upvotes: 1