Reputation: 8025
I have a dictionary with key value pairs sentence_ID
and cluster_ID
respectively.
This is the format: {sentence_ID : cluster_ID}
Example:
my_id_dict:
{0: 71,
1: 63,
2: 66,
3: 92,
4: 49,
5: 85
.
.}
In total, i have 200,000 over sentence_IDs and 100 cluster_IDs.
I am trying to loop over my_id_dict
to generate a list of sentence_ids for each cluster.
Example output i want:
Cluster 0
[63, 71, 116, 168, 187, 231, 242, 290, 330, 343]
Cluster 1
[53, 107, 281, 292, 294, 313, 353, 392, 405, 479]
This is the code that i used:
The logic is that for each cluster, create a sentence list, then for cluster_id in all the 200,000 over dict values, if the dict values == current cluster index, write the sentence ID to the sentence list.
Continue for 100 times.
cluster_dict = defaultdict(list)
num_clusters = 100
for cluster in xrange(0,num_clusters):
print "\nCluster %d" % cluster
sentences = []
for i in xrange(0,len(my_id_dict.values())):
if( my_id_dict.values()[i] == cluster ):
sentences.append(my_id_dict.keys()[i])
cluster_dict[cluster] = sentences
print sentences[:10]
This works but is terribly slow. Is there a faster way that i can do this?
Upvotes: 1
Views: 536
Reputation: 60994
You're going over every sentence for each cluster. Just go over each sentence once, assigning it to a cluster:
cluster_dict = defaultdict(list)
for sentence, cluster in my_id_dict.items():
cluster_dict[cluster].append(sentence)
Upvotes: 1