Python: creating undirected weighted graph from a co-occurrence matrix

Question

I am using Python 2.7 to create a project that would use Twitter data and analyze it. The main concept is to collect tweets and get the most common hashtags used in that collection of tweets and then I need to create a graph where hashtags would be nodes. If those hashtags would happen to appear in the same tweet that would be an edge in the graph and weight of that edge would be the co-occurrence number. So I am trying to create a dictionary of dictionaries using defaultdict(lambda : defaultdict(int)) and create a graph using networkx.from_dict_of_dicts

My code for creating the co-occurrence matrix is

def coocurrence (common_entities):


com = defaultdict(lambda : defaultdict(int))

# Build co-occurrence matrix
for i in range(len(common_entities)-1):            
    for j in range(i+1, len(common_entities)):
        w1, w2 = sorted([common_entities[i], common_entities[j]])                
        if w1 != w2:
            com[w1][w2] += 1


return com

But in order to use networkx.from_dict_of_dicts I need it to be in this format: com= {0: {1:{'weight':1}}}

Do you have any ideas how I can solve this? Or a different way of creating a graph like this?

Patrick Maupin · Accepted Answer

First of all, I would sort the entities first, so you're not continually running sort inside the loop. Then I would use itertools.combinations to get the combinations. The straightforward translation of what you need with those changes is this:

from itertools import combinations
from collections import defaultdict


def coocurrence (common_entities):

    com = defaultdict(lambda : defaultdict(lambda: {'weight':0}))

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        if w1 != w2:
            com[w1][w2]['weight'] += 1

    return com

print coocurrence('abcaqwvv')

It may be more efficient (less indexing and fewer objects created) to build something else first and then generate your final answer in a second loop. The second loop won't run for as many cycles as the first because all the counts have already been calculated. Also, since the second loop isn't running for as many cycles, it may be that deferring the if statement to the second loop could save more time. As usual, run timeit on multiple variations if you care, but here is one possible example of the two loop solution:

def coocurrence (common_entities):

    com = defaultdict(int)

    # Build co-occurrence matrix
    for w1, w2 in combinations(sorted(common_entities), 2):
        com[w1, w2] += 1

    result = defaultdict(dict)
    for (w1, w2), count in com.items():
        if w1 != w2:
            result[w1][w2] = {'weight': count}
    return result

print coocurrence('abcaqwvv')

Python: creating undirected weighted graph from a co-occurrence matrix

Answers (2)

Related Questions