Dustin Sun
Dustin Sun

Reputation: 5532

Smarter way to check if a string contains an element in a list - python

List top_brands contains a list of brands, such as

top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]

items is a pandas.DataFrame and the structure is shown below. My task is to fill the brand_name from item_title if brand_name is missing

row     item_title                 brand_name

1    |  Apple 6S                  |  Apple
2    |  New Victoria\'s Secret    |  missing  <-- need to fill with Victoria\'s Secret
3    |  Used Samsung TV           |  missing  <--need fill with Samsung
4    |  Used bike                 |  missing  <--No need to do anything because there is no brand_name in the title 
    ....

My code is as below. The problem is that it is too slow for a dataframe that contains 2 million records. Any way I can use pandas or numpy to handle the task?

def get_brand_name(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']

    item_title = row['item_title']

    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)): 
            print(brand)
            return brand

    return 'missing'    ### end of get_brand_name


items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)

Upvotes: 5

Views: 1141

Answers (4)

sinapan
sinapan

Reputation: 1000

Building a dataset with 2M data points:

import pandas as pd
import time
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', 'Samsung']
items = pd.DataFrame(
        [['Apple 6S', 'Apple'],
         ['New Victoria\'s Secret', 'missing'],
         ['Used Samsung TV', 'missing'],
         ['Used bike', 'missing']],
         columns=['item_title', 'brand_name'])
items = pd.concat([items]*500000, ignore_index=True)

Timing your original code to use as a reference for comparison:

''' Code Block 1 '''
items1 = items.copy()
t = time.time()
def get_brand_name_v1(row):
    if row['brand_name'] != 'missing':
        return row['brand_name']
    item_title = row['item_title']
    for brand in top_brands:
        brand_start = brand + ' '
        brand_in_between = ' ' + brand + ' '
        brand_end = ' ' + brand
        if ((brand_in_between in item_title) or \
            item_title.endswith(brand_end) or  \
            item_title.startswith(brand_start)): 
            return brand
    return 'missing'
items1['brand_name'] = items1.apply(lambda x: get_brand_name_v1(x), axis=1)
print('Code Block 1 time: {:f}'.format(time.time()-t))

# Code Block 1 time: 53.718933

A modified version of your code: Working with NAN values is usually faster than string comparison with 'missing'. Also, in my experience, making temporary "pointers" for direct calls to a value in your dataframe is a bit faster compared to doing a call on your entire dataframe (e.g. using brand_name as a pointer instead of calling row['brand_name'] multiple times)

''' Code Block 2 '''
items2 = items.copy()
t = time.time()
items2.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
def get_brand_name_v2(row):
    brand_name = row['brand_name']
    if brand_name is not None: return brand_name
    item_title = row['item_title']
    for brand in top_brands:
        if brand in item_title: return brand
items2['brand_name'] = items2.apply(lambda x: get_brand_name_v2(x), axis=1)
items2.loc[:,'brand_name'].fillna('missing', inplace=True)
print('Code Block 2 time: {:f}'.format(time.time()-t))

# Code Block 2 time: 47.940444

Inspired by Idlehands's answer: This version does not disregard the information in brand_name column of the original data set and instead will fill only the missing values. You will gain speed this way but will use more memory.

''' Code Block 3 '''
items3 = items.copy()
items3.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
t = time.time()
brands = (items3['item_title'].str.extract(
        '(?P<brand_name>{})'.format("|".join(top_brands)), expand=True))
brands.loc[:,'brand_name'].fillna('missing', inplace=True)
items3.loc[:,'brand_name'].fillna(brands.loc[:,'brand_name'], inplace=True)
print('Code Block 3 time: {:f}'.format(time.time()-t))

# Code Block 3 time: 3.388266

you can make these even faster if you can afford to commit using NAN instead of 'missing' in your dataset and removing all operations that replace NAN with 'missing'.

Upvotes: 0

r.ook
r.ook

Reputation: 13848

Try this:

pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)

Output:

              item_title         brand_name
0               Apple 6S              Apple
1  New Victoria's Secret  Victoria's Secret
2        Used Samsung TV            Samsung
3              Used Bike            missing

I ran against a random sample of 2 million items on my machine:

def read_file():
    df = pd.read_csv('file1.txt')
    new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
    return new_df

start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')

Output:

                                   item_title         brand_name
0                                    LG watch                 LG
1                                  Sony watch               Sony
2                                 Used Burger            missing
3                                    New Bike            missing
4                               New underwear            missing
5                                    New Sony               Sony
6                        Used Apple underwear              Apple
7                       Refurbished Panasonic          Panasonic
8                   Used Victoria's Secret TV  Victoria's Secret
9                                Disney phone             Disney
10                                Used laptop            missing
...                                       ...                ...
1999990             Refurbished Disney tablet             Disney
1999991                    Refurbished laptop            missing
1999992                       Nintendo Coffee           Nintendo
1999993                      Nintendo desktop           Nintendo
1999994         Refurbished Victoria's Secret  Victoria's Secret
1999995                           Used Burger            missing
1999996                    Nintendo underwear           Nintendo
1999997                     Refurbished Apple              Apple
1999998                      Refurbished Sony               Sony
1999999                      New Google phone             Google

[2000000 rows x 2 columns]
Took 3.2660000324249268s to process

My Machine's specs:

Windows 7 Pro 64bit Intel i7-4770 @ 3.40GHZ 12.0 GB RAM

3.266 seconds is pretty fast... right?

Upvotes: 2

John R
John R

Reputation: 1508

With multi-word brands needing to be recognized, this is a NER (named entity recognition) task.

You need to cluster the words in item_title into chunks of n maximum length

['New','New Victoria\'s', 'New Victoria\'s Secret', 'Victoria\'s', 'Victoria\'s Secret', 'Secret']

, and then check the clusters against your list of brands.

If you anticipate spelling errors, trigram index your list of brands, and then break down the values of the item_title chunks into trigrams and scoring them against the trigram index. OR you can use levenshtein distance on chunks with a certain n steps of tolerance to prevent really bad matches.

Upvotes: 0

cmaher
cmaher

Reputation: 5215

It seems to me that something like this could work:

top_brands = [r'Coca Cola', r'Apple', r'Victoria\'s Secret', r'Samsung']

df = pd.DataFrame({
         'item_title': ['Apple 6S', 'New Victoria\'s Secret', 'Used Samsung TV', 'Used bike'],
         'brand_name': ['Apple', 'missing', 'missing', 'missing']
         }, columns=['item_title' ,'brand_name'])

#               item_title brand_name
# 0               Apple 6S      Apple
# 1  New Victoria's Secret    missing
# 2        Used Samsung TV    missing
# 3              Used bike    missing

# concatenate brand names into regex string
# with each brand as a capture group
top_brands = '|'.join(['(' + x + ')'  for x in top_brands])

# "(Coca Cola)|(Apple)|(Victoria\\'s Secret)|(Samsung)"

df.loc[:, 'brand_name'] = df['item_title'].str.extract(\ 
                          top_brands).fillna('').sum(axis=1).replace('', 'missing')

#               item_title         brand_name
# 0               Apple 6S              Apple
# 1  New Victoria's Secret  Victoria's Secret
# 2        Used Samsung TV            Samsung
# 3              Used bike            missing

Upvotes: 0

Related Questions