Reputation: 117
I am using Python 2.7 to create a project that would use Twitter data and analyze it. The main concept is to collect tweets and get the most common hashtags used in that collection of tweets and then I need to create a graph where hashtags would be nodes. If those hashtags would happen to appear in the same tweet that would be an edge in the graph and weight of that edge would be the co-occurrence number. So I am trying to create a dictionary of dictionaries using defaultdict(lambda : defaultdict(int))
and create a graph using networkx.from_dict_of_dicts
My code for creating the co-occurrence matrix is
def coocurrence (common_entities):
com = defaultdict(lambda : defaultdict(int))
# Build co-occurrence matrix
for i in range(len(common_entities)-1):
for j in range(i+1, len(common_entities)):
w1, w2 = sorted([common_entities[i], common_entities[j]])
if w1 != w2:
com[w1][w2] += 1
return com
But in order to use networkx.from_dict_of_dicts
I need it to be in this format: com= {0: {1:{'weight':1}}}
Do you have any ideas how I can solve this? Or a different way of creating a graph like this?
Upvotes: 1
Views: 1972
Reputation: 157
This is the working code and best
def coocurrence(*inputs):
com = defaultdict(int)
for named_entities in inputs:
# Build co-occurrence matrix
for w1, w2 in combinations(sorted(named_entities), 2):
com[w1, w2] += 1
com[w2, w1] += 1 #Including both directions
result = defaultdict(dict)
for (w1, w2), count in com.items():
if w1 != w2:
result[w1][w2] = {'weight': count}
return result
Upvotes: 0
Reputation: 8137
First of all, I would sort the entities first, so you're not continually running sort inside the loop. Then I would use itertools.combinations to get the combinations. The straightforward translation of what you need with those changes is this:
from itertools import combinations
from collections import defaultdict
def coocurrence (common_entities):
com = defaultdict(lambda : defaultdict(lambda: {'weight':0}))
# Build co-occurrence matrix
for w1, w2 in combinations(sorted(common_entities), 2):
if w1 != w2:
com[w1][w2]['weight'] += 1
return com
print coocurrence('abcaqwvv')
It may be more efficient (less indexing and fewer objects created) to build something else first and then generate your final answer in a second loop. The second loop won't run for as many cycles as the first because all the counts have already been calculated. Also, since the second loop isn't running for as many cycles, it may be that deferring the if statement
to the second loop could save more time. As usual, run timeit on multiple variations if you care, but here is one possible example of the two loop solution:
def coocurrence (common_entities):
com = defaultdict(int)
# Build co-occurrence matrix
for w1, w2 in combinations(sorted(common_entities), 2):
com[w1, w2] += 1
result = defaultdict(dict)
for (w1, w2), count in com.items():
if w1 != w2:
result[w1][w2] = {'weight': count}
return result
print coocurrence('abcaqwvv')
Upvotes: 2