Igor Andri
Igor Andri

Reputation: 189

How to increase speed of Python computations in large dataset?

I have a large dataset, it contains tags corresponding to photos in real web-site (500 000 records, each one contains at least one tag)

Example:

tag1

tag1 tag2 tag3

tag1 tag12 tag99

and so on, 500000 times

I try to compute weights for the tags based on the number of occurrences of each tag in the dataset. For 500 rows the code works good (0.1 sec), but for the whole data it takes hours and hours (more than 8), even for PyPy

I assume I do something wrong and use Python inefficiently. This is the code for computing weights:

for i, photo in enumerate(data):
  for j, tag in enumerate(photo):   
    if (tag not in tag_set):
      tag_set.append(tag)
      tag_w.append(log(len(data)) - log(sum(x.count(tag) for x in data)))

How can I speed it up?

Thanks!

Upvotes: 2

Views: 443

Answers (3)

user2357112
user2357112

Reputation: 281401

x.count(tag) for x in data

This part loops over all your tags in all your data. You do this once for every tag. That's a lot of unnecessary looping. Count the tags once, with Counter or defaultdict(int). If this is still slow with Counter, defaultdict(int) is probably faster, or maybe even just a regular dict. I'll use a Counter:

import collections
import itertools
tag_counts = collections.Counter(itertools.chain.from_iterable(data))
tag_to_weight_dict = {tag: weight_function(count)
                      for tag, count in tag_counts.items()}

Upvotes: 2

user2042696
user2042696

Reputation:

If I understand correctly, you have comma-separated values, one per row of your file and you want to count globally how often each tag occurs and then compute a weight for each tag based on your global tally.

As others have pointed out you had a superfluous operation on the last line of your code.

As another response mentioned, you could start by counting with collections.Counter. If the number of keys is enormous you can probably also speed things up by working with numpy.array:

data_counted = collections.Counter([tag for photo in data for tag in photo])
data_vec = numpy.array([data_counted[k] for k in data_counted.keys()])
weights = numpy.subtract(log(len(data)), numpy.log(data_vec))

The above doesn't really preserve the ordering of your keys in data_counted but there is probably a way to use a hashing function. Or skip the Counter step altogether and load everything straight into a numpy.array.

Upvotes: 0

jez
jez

Reputation: 15349

This would be faster, because it doesn't repeat the counting operation for each tag:

from math import log
tag_counts = {}
tag_weights = {}

for photo in data:
    for tag in photo:
        tag_counts[tag] = tag_counts.get(tag, 0) + 1

for tag, count in tag_counts.items():
    tag_weights[tag] = log(len(data)) - log(count)

Upvotes: 0

Related Questions