user1217169
user1217169

Reputation: 439

Lookup in dictonary based on wildcard in pandas dataframe

I can't really find any good solution to this from stackoverflow. I managed to get full word match based on iterating over the list, and assigning the value to a new column. I guess df.lookup could also be used in some way to achieve the same, as well as df.get_value and probably also a join...

This is the solution I got working first. This is for looking up full words, but what is the best way to lookup based on wildcard and take the first result? An example could be having a list of bank transactions, trying to match against names of stores to assign a category.

import pandas as pd
df = pd.read_csv("transactions.csv")

d = {
    'SUBWAY': '9',
    'TRANSFER TO': '5',
    'Best Buy': '8'
}

for index, row in df.iterrows():
    if(row['Description'] in d.keys()):
        df.at[index, 'Category'] = d[row['Description']]

Upvotes: 2

Views: 928

Answers (1)

jpp
jpp

Reputation: 164773

Partial match

For partial matches of dictionary keys within df['Description'], you can iterate your dictionary instead of your dataframe:

d = {'IKEA': '9', 'TRANSFER TO SAVINGS': '5', 'Best buy': '8'}
for k, v in d.items():
    df.loc[df['Description'].str.contains(k), 'Category'] = v

Assuming you have a large number of rows relative to dictionary keys, this will be considerably more efficient than iterating rows.

Be careful with dictionary ordering. Pre v3.7, dictionaries should not be assumed to be ordered. In CPython v3.6 and v3.7+, dictionaries are insertion ordered. The above logic will ensure the last match sticks.

Exact match

For exact matches, you can use pd.Series.map followed by fillna:

d = {'IKEA': '9', 'TRANSFER TO SAVINGS': '5', 'Best buy': '8'}
df['Category'] = df['Description'].map(d).fillna(df['Category'])

The fillna is important as you don't want to overwrite existing values in 'Category' with NaN for instances where 'Description' is not in d.

A usually less efficient alternative which avoids fillna is possible via pd.Series.replace.

Upvotes: 5

Related Questions