Reputation: 33
I have a data frame containing transactions from a financial institution. One of the columns ['vendor_full'] is the vendor, but it could contain store numbers, physical location, etc. making it more difficult to summarize based who the vendor actually is.
I have created a dictionary where the key is the vendor name as it may appear in the data frame (or at least a part of the column string) and the value is the vendor name as I would like it to be written to a new column ['vendor_short'].
Based on this question and the answer from @Vaishali I got very close to the solution, but the difference is the user who posted the question above wanted to use the dictionary value as both the search term and the returned value. I would like to search for the key and return the value.
import pandas as pd
data = {'amount': [100, 150, 5, 89, 55, 14], 'vendor_full': ['store_name 1234', 'online_store xx55', 'st_name 9876', 'grocery_store', 'online_shop', 'clothing_store xx']}
cols = ['amount', 'vendor_full']
df = pd.DataFrame(data,columns = cols)
vendor_dict = {'store_name': 'store_name', 'online_store': 'online_store', 'st_name': 'store_name', 'grocery_store': 'grocery_store', 'online_shop': 'online_store', 'clothing_store': 'clothing_store'}
pat = r'({})'.format('|'.join(vendor_dict.values()))
cond = df['vendor_full'].str.contains('|'.join(vendor_dict.keys()))
df.loc[cond, 'vendor_short'] = df['vendor_full'].str.extract((pat), expand=False)
The code above seems to work for the first occurrence of a vendor, but I'm getting NaN for the remaining occurrences.
Actual:
amount vendor_full vendor_short
0 100 store_name 1234 store_name
1 150 online_store xx55 online_store
2 5 st_name 9876 NaN
3 89 grocery_store grocery_store
4 55 online_shop NaN
5 14 clothing_store xx clothing_store
Expected/desired:
amount vendor_full vendor_short
0 100 store_name 1234 store_name
1 150 online_store xx55 online_store
2 5 st_name 9876 store_name
3 89 grocery_store grocery_store
4 55 online_shop online_store
5 14 clothing_store xx clothing_store
Upvotes: 0
Views: 1864
Reputation: 42946
Firs we make dataframe out of your dict. Then we extract the names of your df
so we can merge on these names and get the vendor_short
:
df2 = pd.DataFrame({'vendor_full':list(vendor_dict.keys()),
'vendor_short':list(vendor_dict.values())})
s = df['vendor_full'].str.extract("({})".format('|'.join(df2['vendor_full'])))
df['vendor_short'] = s.merge(df2, left_on=0, right_on='vendor_full')['vendor_short']
amount vendor_full vendor_short
0 100 store_name 1234 store_name
1 150 online_store xx55 online_store
2 5 st_name 9876 store_name
3 89 grocery_store grocery_store
4 55 online_shop online_store
5 14 clothing_store xx clothing_store
Using .map
:
s = df['vendor_full'].str.extract("({})".format('|'.join(vendor_dict.keys())))
df['vendor_short'] = s[0].map(vendor_dict)
amount vendor_full vendor_short
0 100 store_name 1234 store_name
1 150 online_store xx55 online_store
2 5 st_name 9876 store_name
3 89 grocery_store grocery_store
4 55 online_shop online_store
5 14 clothing_store xx clothing_store
Provided by cs95 in the comments
Using regex to extract the names from the vendor_full
column and mapping them to the dict with .map
:
df['vendor_short'] = df['vendor_full'].str.extract('([a-zA-Z_]+)', expand=False).map(vendor_dict)
amount vendor_full vendor_short
0 100 store_name 1234 store_name
1 150 online_store xx55 online_store
2 5 st_name 9876 store_name
3 89 grocery_store grocery_store
4 55 online_shop online_store
5 14 clothing_store xx clothing_store
Upvotes: 1