Reputation: 5532
List top_brands
contains a list of brands, such as
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', ....]
items
is a pandas.DataFrame
and the structure is shown below. My task is to fill the brand_name
from item_title
if brand_name
is missing
row item_title brand_name
1 | Apple 6S | Apple
2 | New Victoria\'s Secret | missing <-- need to fill with Victoria\'s Secret
3 | Used Samsung TV | missing <--need fill with Samsung
4 | Used bike | missing <--No need to do anything because there is no brand_name in the title
....
My code is as below. The problem is that it is too slow for a dataframe that contains 2 million records. Any way I can use pandas or numpy to handle the task?
def get_brand_name(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or item_title.endswith(brand_end) or item_title.startswith(brand_start)):
print(brand)
return brand
return 'missing' ### end of get_brand_name
items['brand_name'] = items.apply(lambda x: get_brand_name(x), axis=1)
Upvotes: 5
Views: 1141
Reputation: 1000
Building a dataset with 2M data points:
import pandas as pd
import time
top_brands = ['Coca Cola', 'Apple', 'Victoria\'s Secret', 'Samsung']
items = pd.DataFrame(
[['Apple 6S', 'Apple'],
['New Victoria\'s Secret', 'missing'],
['Used Samsung TV', 'missing'],
['Used bike', 'missing']],
columns=['item_title', 'brand_name'])
items = pd.concat([items]*500000, ignore_index=True)
Timing your original code to use as a reference for comparison:
''' Code Block 1 '''
items1 = items.copy()
t = time.time()
def get_brand_name_v1(row):
if row['brand_name'] != 'missing':
return row['brand_name']
item_title = row['item_title']
for brand in top_brands:
brand_start = brand + ' '
brand_in_between = ' ' + brand + ' '
brand_end = ' ' + brand
if ((brand_in_between in item_title) or \
item_title.endswith(brand_end) or \
item_title.startswith(brand_start)):
return brand
return 'missing'
items1['brand_name'] = items1.apply(lambda x: get_brand_name_v1(x), axis=1)
print('Code Block 1 time: {:f}'.format(time.time()-t))
# Code Block 1 time: 53.718933
A modified version of your code:
Working with NAN
values is usually faster than string comparison with 'missing'
. Also, in my experience, making temporary "pointers" for direct calls to a value in your dataframe is a bit faster compared to doing a call on your entire dataframe (e.g. using brand_name
as a pointer instead of calling row['brand_name'] multiple times)
''' Code Block 2 '''
items2 = items.copy()
t = time.time()
items2.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
def get_brand_name_v2(row):
brand_name = row['brand_name']
if brand_name is not None: return brand_name
item_title = row['item_title']
for brand in top_brands:
if brand in item_title: return brand
items2['brand_name'] = items2.apply(lambda x: get_brand_name_v2(x), axis=1)
items2.loc[:,'brand_name'].fillna('missing', inplace=True)
print('Code Block 2 time: {:f}'.format(time.time()-t))
# Code Block 2 time: 47.940444
Inspired by Idlehands's answer:
This version does not disregard the information in brand_name
column of the original data set and instead will fill only the missing
values. You will gain speed this way but will use more memory.
''' Code Block 3 '''
items3 = items.copy()
items3.loc[:,'brand_name'].replace(['missing'], [None], inplace=True)
t = time.time()
brands = (items3['item_title'].str.extract(
'(?P<brand_name>{})'.format("|".join(top_brands)), expand=True))
brands.loc[:,'brand_name'].fillna('missing', inplace=True)
items3.loc[:,'brand_name'].fillna(brands.loc[:,'brand_name'], inplace=True)
print('Code Block 3 time: {:f}'.format(time.time()-t))
# Code Block 3 time: 3.388266
you can make these even faster if you can afford to commit using NAN
instead of 'missing'
in your dataset and removing all operations that replace NAN
with 'missing'
.
Upvotes: 0
Reputation: 13848
Try this:
pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
Output:
item_title brand_name
0 Apple 6S Apple
1 New Victoria's Secret Victoria's Secret
2 Used Samsung TV Samsung
3 Used Bike missing
I ran against a random sample of 2 million items on my machine:
def read_file():
df = pd.read_csv('file1.txt')
new_df = pd.concat([df['item_title'], df['item_title'].str.extract('(?P<brand_name>{})'.format("|".join(top_brands)), expand=True).fillna('missing')], axis=1)
return new_df
start = time.time()
print(read_file())
end = time.time() - start
print(f'Took {end}s to process')
Output:
item_title brand_name
0 LG watch LG
1 Sony watch Sony
2 Used Burger missing
3 New Bike missing
4 New underwear missing
5 New Sony Sony
6 Used Apple underwear Apple
7 Refurbished Panasonic Panasonic
8 Used Victoria's Secret TV Victoria's Secret
9 Disney phone Disney
10 Used laptop missing
... ... ...
1999990 Refurbished Disney tablet Disney
1999991 Refurbished laptop missing
1999992 Nintendo Coffee Nintendo
1999993 Nintendo desktop Nintendo
1999994 Refurbished Victoria's Secret Victoria's Secret
1999995 Used Burger missing
1999996 Nintendo underwear Nintendo
1999997 Refurbished Apple Apple
1999998 Refurbished Sony Sony
1999999 New Google phone Google
[2000000 rows x 2 columns]
Took 3.2660000324249268s to process
My Machine's specs:
Windows 7 Pro 64bit Intel i7-4770 @ 3.40GHZ 12.0 GB RAM
3.266 seconds is pretty fast... right?
Upvotes: 2
Reputation: 1508
With multi-word brands needing to be recognized, this is a NER (named entity recognition) task.
You need to cluster the words in item_title into chunks of n maximum length
['New','New Victoria\'s', 'New Victoria\'s Secret', 'Victoria\'s', 'Victoria\'s Secret', 'Secret']
, and then check the clusters against your list of brands.
If you anticipate spelling errors, trigram index your list of brands, and then break down the values of the item_title chunks into trigrams and scoring them against the trigram index. OR you can use levenshtein distance on chunks with a certain n steps of tolerance to prevent really bad matches.
Upvotes: 0
Reputation: 5215
It seems to me that something like this could work:
top_brands = [r'Coca Cola', r'Apple', r'Victoria\'s Secret', r'Samsung']
df = pd.DataFrame({
'item_title': ['Apple 6S', 'New Victoria\'s Secret', 'Used Samsung TV', 'Used bike'],
'brand_name': ['Apple', 'missing', 'missing', 'missing']
}, columns=['item_title' ,'brand_name'])
# item_title brand_name
# 0 Apple 6S Apple
# 1 New Victoria's Secret missing
# 2 Used Samsung TV missing
# 3 Used bike missing
# concatenate brand names into regex string
# with each brand as a capture group
top_brands = '|'.join(['(' + x + ')' for x in top_brands])
# "(Coca Cola)|(Apple)|(Victoria\\'s Secret)|(Samsung)"
df.loc[:, 'brand_name'] = df['item_title'].str.extract(\
top_brands).fillna('').sum(axis=1).replace('', 'missing')
# item_title brand_name
# 0 Apple 6S Apple
# 1 New Victoria's Secret Victoria's Secret
# 2 Used Samsung TV Samsung
# 3 Used bike missing
Upvotes: 0