Is there any way I can make this code faster?

Question

I'm writing this search engine in python for a list of recipes and it's supposed to run at a certain speed per search 0.1 seconds max. I've been struggling to achieve this speed with my code. I've been getting 0.4 on average. I wanted to know if you have any idea on how to make this code faster. I've try so many things but I know that the loop is the one that's making it way slower. If I can improve it with mostly python and not adding that many modules.

I've already made it faster on the other parts of the code 0.005 avg. but in this part with big amount of recipes it gets pretty slow.

def countTokens(token):
    token = str(token).lower()

    #make digits an punctuations white spaces
    tokens = token.translate(token.maketrans(digits + punctuation,\
            " "*len(digits + punctuation))) 

    return tokens.split(" ")

def normalOrder(recipes, queries):
    for r in recipes:
        k = r.keys()  
        parts, scores = [[],[],[],[]], 0
        parts[0] = countTokens(r["title"])
        parts[1] = countTokens(r["categories"]) if "categories" in k else []
        parts[2] = countTokens(r["ingredients"]) if "ingredients" in k else []
        parts[3] = countTokens(r["directions"]) if "directions" in k else []
        for q in queries:
            scores += 8 * parts[0].count(q) + 4 * parts[1].count(q) + 2 * parts[2].count(q) + 1 * parts[3].count(q)

        r["score"] = scores + r["rating"] if "rating" in k else 0
    return recipes

On a little bit of context I need to sum the amount of occurrences of a queries in the four descriptor above only if the have it that's why I have the if's.

Richard E · Accepted Answer

I noticed a few points:

Each time you're calling countTokens, you are generating the same translation table again (the maketrans call). I guess this will not be optimized away, so you're probably losing performance there.
tokens.split(" ") creates a list of all words in the string, which is rather expensive, e.g. when the string is 100.000 words. You don't need that.
Overall, it looks like you're simply trying to count how often a word is contained in a string. Using string.count(), you could count the occurences with much less overhead.

If you apply that, you don't need the countTokens function anymore, and with a little bit of refactoring end up with this:

def normalOrder(recipes, queries):
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        for query in queries:
            recipe["score"] += (
                8 * recipe["title"].lower().count(query)
                + 4 * recipe["categories"].lower().count(query)
                + 2 * recipe["ingredients"].lower().count(query)
                + 1 * recipe["directions"].lower().count(query)
            )

    return recipes

Does this work for you -- and is it fast enough?

Edit: In your original code, you wrapped access to recipe["title"] and the other strings in another str() call. I guess they already are string? If they are not, you need to add that here.

Edit2: You stated in the comments that punctuation is a problem. As I said in the comments, I think you don't need to worry about that, since the count calls will only care about punctuation characters if both the query word and the recipe text contain one, then the count call will only count the occurences where surrounding punctuation matches what is queried. Take a look at these examples:

>>> "Some text, that...".count("text")
1
>>> "Some text, that...".count("text.")
0
>>> "Some text, that...".count("text,")
1

If you want this to behave differently, you could do something like you are doing in your original question: Create a translation table and apply that. Keep in mind that applying this translation to the recipe texts (as you did in your question) does not make much sense, since then, any query word that contains punctuation just never matches. This could be done much easier by just ignoring all query word that contain punctuation. You would probably want to do the translation on the query term, so that if someone enters "potato,", you find all occurences of "potato". This would look like this:

def normalOrder(recipes, queries):
    translation_table = str.maketrans(digits + punctuation, " " * len(digits + punctuation))
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        for query in queries:
            replaced_query = query.translate(translation_table)
            recipe["score"] += (
                8 * recipe["title"].lower().count(replaced_query)
                + 4 * recipe["categories"].lower().count(replaced_query)
                + 2 * recipe["ingredients"].lower().count(replaced_query)
                + 1 * recipe["directions"].lower().count(replaced_query)
            )

    return recipes

Edit3: In the comments, you stated that you want a search for ["honey", "lemon"] to match "honey-lemon", but that you do not want "butter" to match "butterfingers". For this, your initial approach is probably the best solution, but keep in mind that searching for a singular form "potato" will not match the plural form ("potatoes") or any other derived form anymore.

def normalOrder(recipes, queries):
    transtab = str.maketrans(digits + punctuation, " " * len(digits + punctuation))
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        title_words = recipe["title"].lower().translate(transtab).split()
        category_words = recipe["categories"].lower().translate(transtab).split()
        ingredient_words = recipe["ingredients"].lower().translate(transtab).split()
        direction_words = recipe["directions"].lower().translate(transtab).split()

        for query in queries:
            recipe["score"] += (
                8 * title_words.count(query)
                + 4 * category_words.count(query)
                + 2 * ingredient_words.count(query)
                + 1 * direction_words.count(query)
            )

    return recipes

If you invoke this function more often with the same recipies, you can get more performance out of it by storing the results of .lower().translate().split() in the recipies, then you don't need to recreate that list in every invocation.

Depending on your input data (how many queries to you have on average?), it might also make sense to just go though the split() result once and just sum up the count of each word. This would make a look up for a single word much fast and can also be kept between function invocations, but it's more expensive to build:

from collections import Counter

transtab = str.maketrans(digits + punctuation, " " * len(digits + punctuation))

def counterFromString(string):
    words = string.lower().translate(transtab).split()
    return Counter(words)

def normalOrder(recipes, queries):
    for recipe in recipes:
        recipe["score"] = recipe.get("rating", 0)

        title_counter = counterFromString(recipe["title"])
        category_counter = counterFromString(recipe["categories"])
        ingredient_counter = counterFromString(recipe["ingredients"])
        direction_counter = counterFromString(recipe["directions"])

        for query in queries:
            recipe["score"] += (
                8 * title_counter[query]
                + 4 * category_counter[query]
                + 2 * ingredient_counter[query]
                + 1 * direction_counter[query]
            )

    return recipes

Edit4: I've replaced the defaultdict with a Counter -- wasn't aware that class existed.

Is there any way I can make this code faster?

Answers (2)

Related Questions