jonboy
jonboy

Reputation: 374

Map dict keys to new column in pandas df

I am currently mapping dict keys to a column based values in a separate column. I'm using the values in Code to match the values in a dict and copying the keys to a separate column. So the new column will contain numbers 1,2,3.

This works fine except for the multiple code values at the same timestamp. I only want to have one mapped value for each unique timestamp.

If there are multiple value at the same timestamp but the mapped number will be the same (A,B), then just take the first value. I can use .drop_duplicates for this.

However, If the mapped numbers are different at the same time point, I want to drop 2 and select 1. .drop_duplicates will only work if 1 if listed before 2 but not vice versa

import pandas as pd
from fuzzywuzzy import process

df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
        'Code' : ['A','C','X','Y','A','B','X','A','Z','L'],                                   
        })

# Dictionary that contains how to map numbers
hdict = {'1' : ['A', 'B'],

    '2' : ['X','Y','Z'],

    '3' : ['D']}

def hColumn(df):

    # Construct a dataframe from the helper dictionary
    df1 = pd.DataFrame([*hdict.values()], index = hdict.keys()).T.melt().dropna()   
    # Get relevant matches using the library.
    m = df['Code'].apply(lambda x: process.extract(x, df1.value)[0])
    # Concat the matches with original df
    df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
    df2.columns = [*df.columns, 'matches']
    # After merge it with df1
    df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
    # Drop columns that are not required and rename.
    df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'H'})
    # Drop unwanted rows
    df2 = df2.mask(df2['H'].isna())
    df2 = df2.dropna(subset = ['H'])

    return df2

df = hColumn(df)

Intended Output:

                    Time Code  H
0  2019-08-02 09:50:10.1    A  1
1  2019-08-02 09:50:10.2    X  2
2  2019-08-02 09:50:10.3    Y  2
3  2019-08-02 09:50:10.4    A  1
4  2019-08-02 09:50:10.5    A  1
5  2019-08-02 09:50:10.6    Z  2

Upvotes: 1

Views: 1124

Answers (3)

jezrael
jezrael

Reputation: 863541

Use DataFrame.drop_duplicates:

df = df.drop_duplicates('Time')

If possible, here is another solution - it match values by Series.map:

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in hdict.items() for k in oldv}
df["H"] = df['Code'].map(d)
df = df.dropna(subset=['H']).drop_duplicates('Time')
print (df)
                    Time Code  H
0  2019-08-02 09:50:10.1    A  1
2  2019-08-02 09:50:10.2    X  2
3  2019-08-02 09:50:10.3    Y  2
4  2019-08-02 09:50:10.4    A  1
6  2019-08-02 09:50:10.5    X  2
8  2019-08-02 09:50:10.6    Z  2

Upvotes: 1

qaiser
qaiser

Reputation: 2868

from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(hdict)

df['H'] = df['Code'].apply(lambda x : kp.extract_keywords(x))

df['H'] = df['H'].apply(lambda x: pd.Series(x[0]) if x else pd.Series())
df.dropna(inplace = True)
df

enter image description here

Upvotes: 1

Artiom  Kozyrev
Artiom Kozyrev

Reputation: 3856

If I were you I would "reverse" you dictionary, it will very simplify all next actions, my solution is as follows, please let me know if it helped and feel free to ask questions:

import pandas as pd

# start data:
df = pd.DataFrame({
        'Time': ['2019-08-02 09:50:10.1', '2019-08-02 09:50:10.1', '2019-08-02 09:50:10.2',
                 '2019-08-02 09:50:10.3', '2019-08-02 09:50:10.4', '2019-08-02 09:50:10.4',
                 '2019-08-02 09:50:10.5', '2019-08-02 09:50:10.5',
                 '2019-08-02 09:50:10.6', '2019-08-02 09:50:10.6'],
        'Code': ['A', 'C', 'X', 'Y', 'A', 'B', 'X', 'A', 'Z', 'L'],
        })

#start data:
hdict = {'1': ['A', 'B'],
         '2': ['X', 'Y', 'Z'],
         '3': ['D']
         }

# "reverse dictionary"
reversed_hdict = {}
for k, v in hdict.items():
    for element in v:
        reversed_hdict[element] = k

print(reversed_hdict) # what dictionary we have now

# add new column by apply method:
df["new_column"] = df['Code'].apply(lambda x: reversed_hdict.get(x))

# check what we have now
print(df)

# print without None rows
print(df.dropna())

Upvotes: 1

Related Questions