jxn
jxn

Reputation: 8025

How to speed up iterating through a large dictionary

I have a dictionary with key value pairs sentence_ID and cluster_ID respectively.

This is the format: {sentence_ID : cluster_ID}

Example:

my_id_dict:
    {0: 71, 
    1: 63, 
    2: 66, 
    3: 92, 
    4: 49, 
    5: 85
      .
      .}

In total, i have 200,000 over sentence_IDs and 100 cluster_IDs.

I am trying to loop over my_id_dict to generate a list of sentence_ids for each cluster.

Example output i want:

Cluster 0
[63, 71, 116, 168, 187, 231, 242, 290, 330, 343]

Cluster 1
[53, 107, 281, 292, 294, 313, 353, 392, 405, 479]

This is the code that i used:

The logic is that for each cluster, create a sentence list, then for cluster_id in all the 200,000 over dict values, if the dict values == current cluster index, write the sentence ID to the sentence list.

Continue for 100 times.

    cluster_dict = defaultdict(list)
    num_clusters = 100

    for cluster in xrange(0,num_clusters):
        print "\nCluster %d" % cluster

        sentences = []
        for i in xrange(0,len(my_id_dict.values())):
            if( my_id_dict.values()[i] == cluster ):
                sentences.append(my_id_dict.keys()[i])

        cluster_dict[cluster] = sentences
        print sentences[:10]

This works but is terribly slow. Is there a faster way that i can do this?

Upvotes: 1

Views: 536

Answers (1)

Patrick Haugh
Patrick Haugh

Reputation: 60994

You're going over every sentence for each cluster. Just go over each sentence once, assigning it to a cluster:

cluster_dict = defaultdict(list)
for sentence, cluster in my_id_dict.items():
    cluster_dict[cluster].append(sentence)

Upvotes: 1

Related Questions