Reputation: 2253
I use networkx
for building a graph which consists of keywords extracted from the texts. I tried 2 approaches and I noticed one strange thing.
The first approach is
def build_graph(all_nodes):
G = nx.Graph()
for nodes in all_nodes:
for pair_of_nodes in itertools.combinations(set(nodes), 2):
u, v = pair_of_nodes
if G.has_edge(u, v):
G[u][v]['weight'] += 1
else:
G.add_edge(u, v, weight=1)
return G
and the second one is
def build_graph_2(all_nodes):
nodes_weighted = Counter()
for nodes in all_nodes:
nodes_weighted.update(itertools.combinations(set(nodes), 2))
return nx.Graph((x, y, {'weight': v}) for (x, y), v in nodes_weighted.items())
For testing I generated a random sample like [[text_1_kw_1, text_1_kw_2, text_1_kw_3, ...], [text_2_kw_1, text_2_kw_2, ...], ...]
texts = [[random.randint(0, 100) for _ in range(100)] for _ in range(500)]
texts = [list(map(str, text)) for text in texts]
Here the first approach gave
713 ms ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the second one
135 ms ± 9.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The second approach looks more attractive.
But length of a keyword usually is more than 10 chars. So here I make a new one test sample.
texts = [[random.randint(10000, 10000000000) for _ in range(100)] for _ in range(500)]
texts = [list(map(str, text)) for text in texts]
However here we get the following result for the first approach
3.04 s ± 91.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And for the second is
3.56 s ± 332 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The superiority of the second approach over the first decreased. I don't fully understand why does it happen and how can I build a graph faster than using the first approach?
Upvotes: 0
Views: 398
Reputation: 20457
It looks to me like you are addressing nodes by a text, which could be very long. So every time you search for a node you have to compare a long text with every other long text - horribly slow.
Suggestion: Use an integer index to address the nodes. Initially build a map of indices (key) and the associated text (value). Then build the graph using node indices. Internally, work with node indices. Only look up the text for a node in the map when you are outputting human readable results.
Alternative: Since your node texts are unique, use a set instead of a map. Code it so the node index and the text index in the set are the same. The code required will be simpler.
Upvotes: 1