Reputation: 189
I have a large dataset, it contains tags corresponding to photos in real web-site (500 000 records, each one contains at least one tag)
Example:
tag1
tag1 tag2 tag3
tag1 tag12 tag99
and so on, 500000 times
I try to compute weights for the tags based on the number of occurrences of each tag in the dataset. For 500 rows the code works good (0.1 sec), but for the whole data it takes hours and hours (more than 8), even for PyPy
I assume I do something wrong and use Python inefficiently. This is the code for computing weights:
for i, photo in enumerate(data):
for j, tag in enumerate(photo):
if (tag not in tag_set):
tag_set.append(tag)
tag_w.append(log(len(data)) - log(sum(x.count(tag) for x in data)))
How can I speed it up?
Thanks!
Upvotes: 2
Views: 443
Reputation: 281401
x.count(tag) for x in data
This part loops over all your tags in all your data. You do this once for every tag. That's a lot of unnecessary looping. Count the tags once, with Counter
or defaultdict(int)
. If this is still slow with Counter
, defaultdict(int)
is probably faster, or maybe even just a regular dict. I'll use a Counter
:
import collections
import itertools
tag_counts = collections.Counter(itertools.chain.from_iterable(data))
tag_to_weight_dict = {tag: weight_function(count)
for tag, count in tag_counts.items()}
Upvotes: 2
Reputation:
If I understand correctly, you have comma-separated values, one per row of your file and you want to count globally how often each tag occurs and then compute a weight for each tag based on your global tally.
As others have pointed out you had a superfluous operation on the last line of your code.
As another response mentioned, you could start by counting with collections.Counter
.
If the number of keys is enormous you can probably also speed things up by working with
numpy.array
:
data_counted = collections.Counter([tag for photo in data for tag in photo])
data_vec = numpy.array([data_counted[k] for k in data_counted.keys()])
weights = numpy.subtract(log(len(data)), numpy.log(data_vec))
The above doesn't really preserve the ordering of your keys in data_counted
but there
is probably a way to use a hashing function.
Or skip the Counter
step altogether and load everything straight into a numpy.array
.
Upvotes: 0
Reputation: 15349
This would be faster, because it doesn't repeat the counting operation for each tag:
from math import log
tag_counts = {}
tag_weights = {}
for photo in data:
for tag in photo:
tag_counts[tag] = tag_counts.get(tag, 0) + 1
for tag, count in tag_counts.items():
tag_weights[tag] = log(len(data)) - log(count)
Upvotes: 0