Reputation: 2438
i ve got a dataframe series which contain sentences. (some are kind of long)
i ve also got 2 dictionaries which contain words as keys and ints as count.
Not all words from strings are present in both dictionaries. Some are in only one, some are in neither.
Dataframe is 124011 units long. function is taking me about 0.4 per string. which is waaaay to long.
W is just a reference value for the dictionary (weights = {}, weights[W] = {})
here is the function:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
thx
Upvotes: 0
Views: 91
Reputation: 34205
It would be good if you had a profile of that function execution, but here are some generic ideas:
For example
weights_W = weights[W]
rel_weights_W = rel_weights[W]
.keys()
on dicts.These are equivalent:
word in weights_W.keys()
word in weights_W
For example instead of:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
you can do:
word_weight = weights_W.get(word)
if word_weight is not None:
word_rel_weight = rel_weights_W.get(word)
if word_rel_weight is not None:
if word_weight != 0: # lookup saved here
Upvotes: 1
Reputation: 15310
Let's clean it up a bit:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
That's redundant! Replace 4 statements with 2:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
Next:
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
I don't know what you think that code does. I hope you're not being tricky. But .keys
returns an iterable, and X in <iterable>
is WAY slower than X in <dict>
. Also, note: you don't append anything if the innermost (weights[W][word] != 0
) condition fails. That might be a bug, since you try to append 0 in another else condition. (I don't know what you're doing, so I'm just pointing it out.) And this is Python, not Perl or C or Java. So no parens required around if <test>:
Let's go with that:
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
Next:
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
You're trying to prevent dividing by zero. But you can use the truthiness of a list to check this, and avoid the comparison:
if words:
ratios = np.divide(ratios, float(len(words)))
The rest is fine, but you don't need the variable.
ratio = np.sum(ratios)
return ratio
With those mods applied, your function looks like this:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
Looking at it at little harder, I see you're doing this:
word_counts = Counter(words)
for word in words:
append( word_counts[word] * ...)
According to me, that means if "apple" appears 6 times, you are going to append 6*... to the list, once for each word. So you'll have 6 different occurrences of 6*... in your list. Are you sure that's what you want? Or should it be for word in word_counts
to just iterate over the distinct words?
Another optimization is to remove lookups from inside your loop. You keep looking up weights[W]
and rel_weight[W]
, even though the value of W
never changes. Let's cache those values outside the loop. Also, let's cache a pointer to the ratios.append
method.
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
# Cache these values for speed in loop
ratios_append = ratios.append
weights_W = weights[W]
rel_W = rel_weight[W]
for word in words:
if word in weights_W and word in rel_W:
if weights_W[word] != 0:
ratios_append(words_counts[word] * rel_W[word] / weights_W[word])
else:
ratios_append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
Try that, see how it works. Please look at the bold note above, and the questions. There might be bugs, there might be more ways to speed up.
Upvotes: 1
Reputation: 127
I think that your time inefficiency may be coming from the fact you are using Counter instead of the dict. Some discussion here suggests that the dict class has parts written in pure c, while counter is written in python.
I suggest altering your code to using a dict and test to see if that provides a faster time
Also why is this code duplicated?:
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
ratios = []
Upvotes: 1