Reputation: 374
I am currently mapping dict keys to a column based values in a separate column. I'm using the values in Code
to match the values in a dict and copying the keys to a separate column. So the new column will contain numbers 1,2,3
.
This works fine except for the multiple code values at the same timestamp. I only want to have one mapped value for each unique timestamp.
If there are multiple value at the same timestamp but the mapped number will be the same (A,B)
, then just take the first value. I can use .drop_duplicates
for this.
However, If the mapped numbers are different at the same time point, I want to drop 2
and select 1
. .drop_duplicates
will only work if 1
if listed before 2
but not vice versa
import pandas as pd
from fuzzywuzzy import process
df = pd.DataFrame({
'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
'Code' : ['A','C','X','Y','A','B','X','A','Z','L'],
})
# Dictionary that contains how to map numbers
hdict = {'1' : ['A', 'B'],
'2' : ['X','Y','Z'],
'3' : ['D']}
def hColumn(df):
# Construct a dataframe from the helper dictionary
df1 = pd.DataFrame([*hdict.values()], index = hdict.keys()).T.melt().dropna()
# Get relevant matches using the library.
m = df['Code'].apply(lambda x: process.extract(x, df1.value)[0])
# Concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
df2.columns = [*df.columns, 'matches']
# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'H'})
# Drop unwanted rows
df2 = df2.mask(df2['H'].isna())
df2 = df2.dropna(subset = ['H'])
return df2
df = hColumn(df)
Intended Output:
Time Code H
0 2019-08-02 09:50:10.1 A 1
1 2019-08-02 09:50:10.2 X 2
2 2019-08-02 09:50:10.3 Y 2
3 2019-08-02 09:50:10.4 A 1
4 2019-08-02 09:50:10.5 A 1
5 2019-08-02 09:50:10.6 Z 2
Upvotes: 1
Views: 1124
Reputation: 863541
Use DataFrame.drop_duplicates
:
df = df.drop_duplicates('Time')
If possible, here is another solution - it match values by Series.map
:
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in hdict.items() for k in oldv}
df["H"] = df['Code'].map(d)
df = df.dropna(subset=['H']).drop_duplicates('Time')
print (df)
Time Code H
0 2019-08-02 09:50:10.1 A 1
2 2019-08-02 09:50:10.2 X 2
3 2019-08-02 09:50:10.3 Y 2
4 2019-08-02 09:50:10.4 A 1
6 2019-08-02 09:50:10.5 X 2
8 2019-08-02 09:50:10.6 Z 2
Upvotes: 1
Reputation: 2868
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(hdict)
df['H'] = df['Code'].apply(lambda x : kp.extract_keywords(x))
df['H'] = df['H'].apply(lambda x: pd.Series(x[0]) if x else pd.Series())
df.dropna(inplace = True)
df
Upvotes: 1
Reputation: 3856
If I were you I would "reverse" you dictionary, it will very simplify all next actions, my solution is as follows, please let me know if it helped and feel free to ask questions:
import pandas as pd
# start data:
df = pd.DataFrame({
'Time': ['2019-08-02 09:50:10.1', '2019-08-02 09:50:10.1', '2019-08-02 09:50:10.2',
'2019-08-02 09:50:10.3', '2019-08-02 09:50:10.4', '2019-08-02 09:50:10.4',
'2019-08-02 09:50:10.5', '2019-08-02 09:50:10.5',
'2019-08-02 09:50:10.6', '2019-08-02 09:50:10.6'],
'Code': ['A', 'C', 'X', 'Y', 'A', 'B', 'X', 'A', 'Z', 'L'],
})
#start data:
hdict = {'1': ['A', 'B'],
'2': ['X', 'Y', 'Z'],
'3': ['D']
}
# "reverse dictionary"
reversed_hdict = {}
for k, v in hdict.items():
for element in v:
reversed_hdict[element] = k
print(reversed_hdict) # what dictionary we have now
# add new column by apply method:
df["new_column"] = df['Code'].apply(lambda x: reversed_hdict.get(x))
# check what we have now
print(df)
# print without None rows
print(df.dropna())
Upvotes: 1