Reputation: 99
I have been trying to replace part of the texts in a Pandas dataframe column with keys from a dictionary based on multiple values; though I have achieved the desired result, the process or loop is very very slow in large dataset. I would appreciate it if someone could advise me of a more 'Pythonic' way or more efficient way of achieving the result. Pls see below example:
df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})
Dish | Price | Ingredient |
---|---|---|
A | 15 | apple banana apricot lamb |
B | 8 | wheat pork venison |
C | 20 | orange lamb guinea |
The dictionary is below:
CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}
I am trying to replace text in the 'Ingredient' column with key based on dictionary values. For example, 'apple' in the first row wound be replaced by dictionary key: 'FRUIT'.. The desired table is shown below:
Dish | Price | Ingredient |
---|---|---|
A | 15 | FRUIT FRUIT apricot MEAT |
B | 8 | CEREAL MEAT venison |
C | 20 | FRUIT MEAT guinea |
I have seen some related queries here where each key has one value; but in this case, there are multiple values for any given key in the dictionary. So far, I have been able to achieve the desired result but it is painfully slow when working with a large dataset. The code I have used so far to achieve the result is shown below:
countries = list(CountryList.keys())
for country in countries:
for i in range(len(CountryList[country])):
lender = CountryList[country][i]
country = str(country)
lender = str(lender).replace("['",'',).replace("']",'')
df['Ingredient'] = df['Ingredient'].str.replace(lender,country)
Perhaps this could do with multiprocessing? Needless to say, my knowledge of Python needs a lot to be desired.
Any suggestion to speed up the process would be highly appreciated.
Thanking in advance,
Edit: just to add, some keys have more than 60000 values in the dictionary; and about 200 keys in the dictionary, which is making the code very inefficient time-wise.
Upvotes: 3
Views: 1284
Reputation: 18416
If you want to use regex, just join all the values in the CountryList
by pipe |
for each of the keys, and then call Series.str.replace
for each of the keys, it will be a way faster than the way you are trying.
joined={key: '|'.join(item[0] for item in value) for key,value in CountryList.items()}
for key in joined:
df['Ingredient'] = df['Ingredient'].str.replace(joined[key], key, regex=True)
OUTPUT:
Dish Price Ingredient
0 A 15 FRUIT FRUIT apricot MEAT
1 B 8 CEREAL MEAT venison
2 C 20 FRUIT MEAT guinea
Another approach would be to reverse the key,value in the dictionary, and then to use dict.get
for each key
with default value as key
splitting the words in Ingredient
column:
reversedContries={item[0]:key for key,value in CountryList.items() for item in value}
df['Ingredient'].apply(lambda x: ' '.join(reversedContries.get(y,y) for y in x.split()))
Upvotes: 0
Reputation: 77337
You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists
product_to_type = {}
for typ, product_lists in CountryList.items():
for product_list in product_lists:
for product in product_list:
product_to_type[product] = typ
A little python magic lets you compress this step into a generator that creates the dict
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.
import pandas as pd
CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']],
'CEREAL': [['oat'], ['wheat'], ['corn']],
'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]}
product_to_type = {product:typ for typ, product_lists in CountryList.items()
for product_list in product_lists for product in product_list}
def convert_product_to_type(products):
return " ".join(product_to_type.get(product, product)
for product in products.split(" "))
df = pd.DataFrame({'Dish': ['A', 'B','C'],
'Price': [15,8,20],
'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
})
df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)
print(df)
Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.
Upvotes: 2
Reputation: 120409
Change the format of CountryList:
import itertools
CountryList2 = {}
for k, v in CountryList.items():
for i in (itertools.chain.from_iterable(v)):
CountryList2[i] = k
>>> CountryList2
{'apple': 'FRUIT',
'orange': 'FRUIT',
'banana': 'FRUIT',
'oat': 'CEREAL',
'wheat': 'CEREAL',
'corn': 'CEREAL',
'chicken': 'MEAT',
'lamb': 'MEAT',
'pork': 'MEAT',
'turkey': 'MEAT',
'duck': 'MEAT'}
Now you can use replace
:
df['Ingredient'] = df['Ingredient'].replace(CountryList2, regex=True)
>>> df
Dish Price Ingredient
0 A 15 FRUIT FRUIT apricot MEAT
1 B 8 CEREAL MEAT venison
2 C 20 FRUIT MEAT guinea
Upvotes: 4