pythonperformancepandasdictionarycoding-efficiency

Reputation: 2438

how to make this function more (time) efficient?

i ve got a dataframe series which contain sentences. (some are kind of long)

i ve also got 2 dictionaries which contain words as keys and ints as count.

Not all words from strings are present in both dictionaries. Some are in only one, some are in neither.

Dataframe is 124011 units long. function is taking me about 0.4 per string. which is waaaay to long.

W is just a reference value for the dictionary (weights = {}, weights[W] = {})

here is the function:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    ratios = []

    for word in words:

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)

    return ratio

thx

Upvotes: 0

Answers (3)

viraptor

Reputation: 34205

It would be good if you had a profile of that function execution, but here are some generic ideas:

You needlessly get some elements on every iteration. You can extract these before the loop

For example

weights_W = weights[W]
rel_weights_W = rel_weights[W]

You don't need to call .keys() on dicts.

These are equivalent:

word in weights_W.keys()
word in weights_W

Attempt to get values without looking them up first. This will save you one lookup.

For example instead of:

if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
        if (weights[W][word]!=0):

you can do:

word_weight = weights_W.get(word)
if word_weight is not None:
    word_rel_weight = rel_weights_W.get(word)
    if word_rel_weight is not None:
        if word_weight != 0:  # lookup saved here

Upvotes: 1

aghast

Reputation: 15310

Let's clean it up a bit:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    words = string.split()

    words_counts = Counter(words)

That's redundant! Replace 4 statements with 2:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)

    ratios = []

    for word in words:    

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

I don't know what you think that code does. I hope you're not being tricky. But .keys returns an iterable, and X in <iterable> is WAY slower than X in <dict>. Also, note: you don't append anything if the innermost (weights[W][word] != 0) condition fails. That might be a bug, since you try to append 0 in another else condition. (I don't know what you're doing, so I'm just pointing it out.) And this is Python, not Perl or C or Java. So no parens required around if <test>:

Let's go with that:

    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

You're trying to prevent dividing by zero. But you can use the truthiness of a list to check this, and avoid the comparison:

    if words:
        ratios = np.divide(ratios, float(len(words)))

The rest is fine, but you don't need the variable.

    ratio = np.sum(ratios)

    return ratio

With those mods applied, your function looks like this:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

Looking at it at little harder, I see you're doing this:

word_counts = Counter(words)

for word in words:
    append(   word_counts[word] * ...)

According to me, that means if "apple" appears 6 times, you are going to append 6*... to the list, once for each word. So you'll have 6 different occurrences of 6*... in your list. Are you sure that's what you want? Or should it be for word in word_counts to just iterate over the distinct words?

Another optimization is to remove lookups from inside your loop. You keep looking up weights[W] and rel_weight[W], even though the value of W never changes. Let's cache those values outside the loop. Also, let's cache a pointer to the ratios.append method.

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    # Cache these values for speed in loop
    ratios_append = ratios.append
    weights_W = weights[W]
    rel_W = rel_weight[W]

    for word in words:
        if word in weights_W and word in rel_W:
            if weights_W[word] != 0:    
                ratios_append(words_counts[word] * rel_W[word] / weights_W[word])

        else:
            ratios_append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

Try that, see how it works. Please look at the bold note above, and the questions. There might be bugs, there might be more ways to speed up.

Upvotes: 1

kopo222

Reputation: 127

I think that your time inefficiency may be coming from the fact you are using Counter instead of the dict. Some discussion here suggests that the dict class has parts written in pure c, while counter is written in python.

I suggest altering your code to using a dict and test to see if that provides a faster time

Also why is this code duplicated?:

words = string.split()

words_counts = Counter(words)

words = string.split()

words_counts = Counter(words)

ratios = []

Upvotes: 1

how to make this function more (time) efficient?

Answers (3)

Related Questions