NineWasps
NineWasps

Reputation: 2253

Networkx: how to optimize building a graph

I use networkx for building a graph which consists of keywords extracted from the texts. I tried 2 approaches and I noticed one strange thing. The first approach is

def build_graph(all_nodes):
    G = nx.Graph()
    for nodes in all_nodes:
        for pair_of_nodes in itertools.combinations(set(nodes), 2):
            u, v = pair_of_nodes
            if G.has_edge(u, v):
                G[u][v]['weight'] += 1
            else:
                G.add_edge(u, v, weight=1)
    return G

and the second one is

def build_graph_2(all_nodes):
    nodes_weighted = Counter()
    for nodes in all_nodes:
        nodes_weighted.update(itertools.combinations(set(nodes), 2))
    return nx.Graph((x, y, {'weight': v}) for (x, y), v in nodes_weighted.items())

For testing I generated a random sample like [[text_1_kw_1, text_1_kw_2, text_1_kw_3, ...], [text_2_kw_1, text_2_kw_2, ...], ...]

texts = [[random.randint(0, 100) for _ in range(100)] for _ in range(500)]
texts = [list(map(str, text)) for text in texts]

Here the first approach gave

713 ms ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the second one

135 ms ± 9.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The second approach looks more attractive.

But length of a keyword usually is more than 10 chars. So here I make a new one test sample.

texts = [[random.randint(10000, 10000000000) for _ in range(100)] for _ in range(500)]
texts = [list(map(str, text)) for text in texts]

However here we get the following result for the first approach

3.04 s ± 91.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And for the second is

3.56 s ± 332 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The superiority of the second approach over the first decreased. I don't fully understand why does it happen and how can I build a graph faster than using the first approach?

Upvotes: 0

Views: 398

Answers (1)

ravenspoint
ravenspoint

Reputation: 20457

It looks to me like you are addressing nodes by a text, which could be very long. So every time you search for a node you have to compare a long text with every other long text - horribly slow.

Suggestion: Use an integer index to address the nodes. Initially build a map of indices (key) and the associated text (value). Then build the graph using node indices. Internally, work with node indices. Only look up the text for a node in the map when you are outputting human readable results.

Alternative: Since your node texts are unique, use a set instead of a map. Code it so the node index and the text index in the set are the same. The code required will be simpler.

Upvotes: 1

Related Questions