I have been trying to replace part of the texts in a Pandas dataframe column with keys from a dictionary based on multiple values; though I have achieved the desired result, the process or loop is very very slow in large dataset. I would appreciate it if someone could advise me of a more 'Pythonic' way or more efficient way of achieving the result. Pls see below example: df = pd.DataFrame({'Dish': ['A', 'B','C'], 'Price': [15,8,20], 'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea'] }) Dish Price Ingredient A 15 apple banana apricot lamb B 8 wheat pork venison C 20 orange lamb guinea The dictionary is below: CountryList = {'FRUIT': [['apple'], ['orange'], ['banana']], 'CEREAL': [['oat'], ['wheat'], ['corn']], 'MEAT': [['chicken'], ['lamb'], ['pork'], ['turkey'], ['duck']]} I am trying to replace text in the 'Ingredient' column with key based on dictionary values. For example, 'apple' in the first row wound be replaced by dictionary key: 'FRUIT'.. The desired table is shown below: Dish Price Ingredient A 15 FRUIT FRUIT apricot MEAT B 8 CEREAL MEAT venison C 20 FRUIT MEAT guinea I have seen some related queries here where each key has one value; but in this case, there are multiple values for any given key in the dictionary. So far, I have been able to achieve the desired result but it is painfully slow when working with a large dataset. The code I have used so far to achieve the result is shown below: countries = list(CountryList.keys()) for country in countries: for i in range(len(CountryList[country])): lender = CountryList[country][i] country = str(country) lender = str(lender).replace("['",'',).replace("']",'') df['Ingredient'] = df['Ingredient'].str.replace(lender,country) Perhaps this could do with multiprocessing? Needless to say, my knowledge of Python needs a lot to be desired. Any suggestion to speed up the process would be highly appreciated. Thanking in advance, Edit: just to add, some keys have more than 60000 values in the dictionary; and about 200 keys in the dictionary, which is making the code very inefficient time-wise.

Reputation: 99

Replacing text with dictionary keys (having multiple values) in Python - more efficiency

I have been trying to replace part of the texts in a Pandas dataframe column with keys from a dictionary based on multiple values; though I have achieved the desired result, the process or loop is very very slow in large dataset. I would appreciate it if someone could advise me of a more 'Pythonic' way or more efficient way of achieving the result. Pls see below example:

df =  pd.DataFrame({'Dish':  ['A', 'B','C'],
        'Price': [15,8,20],
         'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
        })

Dish	Price	Ingredient
A	15	apple banana apricot lamb
B	8	wheat pork venison
C	20	orange lamb guinea

The dictionary is below:

CountryList = {'FRUIT': [['apple'], ['orange'],  ['banana']],
 'CEREAL': [['oat'], ['wheat'],  ['corn']],
 'MEAT': [['chicken'],  ['lamb'],  ['pork'],  ['turkey'], ['duck']]}

I am trying to replace text in the 'Ingredient' column with key based on dictionary values. For example, 'apple' in the first row wound be replaced by dictionary key: 'FRUIT'.. The desired table is shown below:

Dish	Price	Ingredient
A	15	FRUIT FRUIT apricot MEAT
B	8	CEREAL MEAT venison
C	20	FRUIT MEAT guinea

I have seen some related queries here where each key has one value; but in this case, there are multiple values for any given key in the dictionary. So far, I have been able to achieve the desired result but it is painfully slow when working with a large dataset. The code I have used so far to achieve the result is shown below:

countries = list(CountryList.keys())

for country in countries:
    for i in range(len(CountryList[country])):
        lender = CountryList[country][i]
        country = str(country)
        lender = str(lender).replace("['",'',).replace("']",'')
        df['Ingredient'] = df['Ingredient'].str.replace(lender,country)

Perhaps this could do with multiprocessing? Needless to say, my knowledge of Python needs a lot to be desired.

Any suggestion to speed up the process would be highly appreciated.

Thanking in advance,

Edit: just to add, some keys have more than 60000 values in the dictionary; and about 200 keys in the dictionary, which is making the code very inefficient time-wise.

Upvotes: 3

Answers (3)

ThePyGuy

Reputation: 18416

If you want to use regex, just join all the values in the CountryList by pipe | for each of the keys, and then call Series.str.replace for each of the keys, it will be a way faster than the way you are trying.

joined={key: '|'.join(item[0] for item in value) for key,value in CountryList.items()}

for key in joined:
    df['Ingredient'] = df['Ingredient'].str.replace(joined[key], key, regex=True)

OUTPUT:

  Dish  Price                 Ingredient
0    A     15  FRUIT FRUIT apricot MEAT 
1    B      8        CEREAL MEAT venison
2    C     20          FRUIT MEAT guinea

Another approach would be to reverse the key,value in the dictionary, and then to use dict.get for each key with default value as key splitting the words in Ingredient column:

reversedContries={item[0]:key for key,value in CountryList.items() for item in value}

df['Ingredient'].apply(lambda x: ' '.join(reversedContries.get(y,y) for y in x.split()))

Upvotes: 0

tdelaney

Reputation: 77337

You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists

product_to_type = {}
for typ, product_lists in CountryList.items():
    for product_list in product_lists:
        for product in product_list:
            product_to_type[product] = typ

A little python magic lets you compress this step into a generator that creates the dict

product_to_type = {product:typ for typ, product_lists in CountryList.items()
   for product_list in product_lists for product in product_list}

Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.

import pandas as pd

CountryList = {'FRUIT': [['apple'], ['orange'],  ['banana']],
 'CEREAL': [['oat'], ['wheat'],  ['corn']],
 'MEAT': [['chicken'],  ['lamb'],  ['pork'],  ['turkey'], ['duck']]}

product_to_type = {product:typ for typ, product_lists in CountryList.items()
   for product_list in product_lists for product in product_list}

def convert_product_to_type(products):
    return " ".join(product_to_type.get(product, product) 
        for product in products.split(" "))
    
df =  pd.DataFrame({'Dish':  ['A', 'B','C'],
        'Price': [15,8,20],
         'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
        })

df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)

print(df)

Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.

Upvotes: 2

Corralien

Reputation: 120409

Change the format of CountryList:

import itertools

CountryList2 = {}
for k, v in CountryList.items():
    for i in (itertools.chain.from_iterable(v)):
        CountryList2[i] = k

>>> CountryList2
{'apple': 'FRUIT',
 'orange': 'FRUIT',
 'banana': 'FRUIT',
 'oat': 'CEREAL',
 'wheat': 'CEREAL',
 'corn': 'CEREAL',
 'chicken': 'MEAT',
 'lamb': 'MEAT',
 'pork': 'MEAT',
 'turkey': 'MEAT',
 'duck': 'MEAT'}

Now you can use replace:

df['Ingredient'] = df['Ingredient'].replace(CountryList2, regex=True)

>>> df
  Dish  Price                 Ingredient
0    A     15   FRUIT FRUIT apricot MEAT
1    B      8        CEREAL MEAT venison
2    C     20          FRUIT MEAT guinea

Upvotes: 4

Replacing text with dictionary keys (having multiple values) in Python - more efficiency

Answers (3)

Related Questions