How to optimize this python code? I need to improve its runtime

Question

I want optimize this filter function. It is searching in two list: one is of category and one is of tags. That's why it takes a long time to run this function.

def get_percentage(l1, l2, sim_score):
    diff = intersection(l1, l2)
    size = len(l1)
    if size != 0:
        perc = (diff/size)
        if perc >= sim_score:
                return True
    else:
        return False

def intersection(lst1, lst2):
    return len(list(set(lst1) & set(lst2)))

def filter_entities(country, city, category, entities, entityId):
    valid_entities = []
    tags = get_tags(entities, entityId)
    for index, i in entities.iterrows():
        if i["country"] == country and i["city"] == city:
            for j in i.categories:
                if j == category:
                    if(get_percentage(i["tags"], tags, 0.80)):
                        valid_entities.append(i.entity_id)

    return valid_entities

Engineero · Accepted Answer

You have a couple of unnecessary for loops and if checks in there that you can remove, and you should definitely take advantage of df.loc for selecting elements from your dataframe (assuming entities is a Pandas dataframe):

def get_percentage(l1, l2, sim_score):
    if len(l1) == 0:
        return False  # shortcut this default case
    else:
        diff = intersection(l1, l2)
        perc = (diff / len(l1))
        return perc >= sim_score  # rather than handling each case separately

def intersection(lst1, lst2):
    return len(set(lst1).intersection(lst2))  # almost twice as fast this way on my machine

def filter_entities(country, city, category, entities, entityId):
    valid_entities = []
    tags = get_tags(entities, entityId)
    # Just grab the desired elements directly, no loops
    entity = entities.loc[(entities.country == county) &
                          (entities.city == city)]
    if category in entity.categories and get_percentage(entity.tags, tags, 0.8):
        valid_entities.append(entity.entity_id)
    return valid_entities

It's difficult to say for sure that this will help because we can't really run the code you provided, but this should remove some inefficiencies and take advantage of some of the optimizations available in Pandas.

Depending on your data structure (i.e. if you have multiple matches in entity above), you may need to do something like this for the last three lines above:

for ent in entity:
    if category in ent.categories and get_percentage(ent.tags, tags, 0.8):
        valid_entities.append(ent.entity_id)
return valid_entities

How to optimize this python code? I need to improve its runtime

Answers (2)

Related Questions