Reputation: 1585
I have a dataframe containing some data that may or may not contain values in the str_id
column (note the index on this dataframe). I need to check this dataframe against another dataframe containing cached data. If there is a single matching (based on match_val
) row in the cache then I need to copy that rows str_id
back into the original dataframe's column without destroying any existing data in the column. If there is more than one matching row, it should add an error to the original dataframe. These requirements are because, later on, I'll be adding extra functionality to do further cache checks and matches on other columns (e.g. name
) if the first cache check returns nothing.
DF1
str_id match_val name
id
3 None 12345 foo
6 None 67890 bar
9 None None delta
12 existing None leave
CACHE
str_id match_val name
0 abcde 12345 alpha
1 edcba 12345 beta
2 ghij 67890 gamma
3 uofd 11111 delta
4 kfsl 11111 epsilon
5 xyz None theta
The desired result should be,
DF1
str_id match_val name error
id
3 None 12345 foo duplicate
6 ghij 67890 bar None
9 None None delta None
12 existing None leave None
I have the following code so far that will correctly calculate single and multiple matches,
df1["legacy_id"] = df1.index
merged = pandas.merge(left=df1[df1['match_val'].notnull()], right=df_cache[df_cache['match_val'].notnull()], on='match_val', how='left', suffixes=['_df', '_cache'])
merge_count = merged.groupby('match_val').size()
errors = merged[merged['match_val'].isin(merge_count[merge_count>1].index)][['legacy_id']]
errors['error'] = 'Duplicate match_val'
errors.set_index('legacy_id', inplace=True)
errors = errors[~errors.index.duplicated()]
matches = merged[merged['match_val'].isin(merge_count[merge_count==1].index)][['legacy_id', 'match_val', 'str_id_cache']]
matches.set_index('legacy_id', inplace=True)
But I can't figure out how to incorporate the data correctly back into the original dataframe. If I assign the columns it destroys the data in the row with the existing str_id (see below). And I would assume that in subsequent cache checks this would be true for both the str_id
and error
columns. So how can I assign the data only for rows whose indexes match? Or is there another completely different way that I should be attempting this?
df1['str_id'] = matches['str_id_cache']
df1['error'] = errors['error']
DF1
str_id match_val name legacy_id error
id
3 NaN 12345 foo 3 Duplicate match_val
6 ghij 67890 bar 6 NaN
9 NaN None delta 9 NaN
12 NaN None leave 12 NaN
Upvotes: 1
Views: 873
Reputation:
You can use .update()
to make changes where the indexes match - it also matches on column names - so you'd need to .rename()
df1.update(matches.rename(columns={'str_id_cache': 'str_id'}))
>>> df1
str_id match_val name
id
3 None 12345 foo
6 ghij 67890 bar
9 None None delta
12 existing None leave
You could also set it up to have the error and matches together and perform a single update.
You can add the duplicated information to the cache.
df2.loc[ df2.match_val.duplicated(keep=False), 'error' ] = 'duplicated'
>>> df2
str_id match_val name error
0 abcde 12345 alpha duplicated
1 edcba 12345 beta duplicated
2 ghij 67890 gamma NaN
3 uofd 11111 delta duplicated
4 kfsl 11111 epsilon duplicated
5 xyz None theta NaN
Full example:
df1['error'] = None
df2.loc[ df2.match_val.duplicated(keep=False), 'error' ] = 'duplicated'
new = df2.drop(columns='name').dropna(subset=['match_val']).drop_duplicates('match_val')
>>> new
str_id match_val error
0 abcde 12345 duplicated
2 ghij 67890 NaN
3 uofd 11111 duplicated
new = new.merge(df1.reset_index(), on='match_val', suffixes=['', '_df']).set_index('id')
>>> new
str_id match_val error str_id_df name error_df
id
3 abcde 12345 duplicated None foo None
6 ghij 67890 NaN None bar None
# if a row is duplicated we want str_id_df as str_id
# new.str_id = numpy.where(new.error == 'duplicated', new.str_id_df, new.str_id)
# new.str_id.mask(new.error == 'duplicated', new.str_id_df, inplace=True)
# new.str_id = new.str_id.mask(new.error == 'duplicated', new.str_id_df)
rows = new.error == 'duplicated'
new.loc[ rows, 'str_id' ] = new.loc[ rows, 'str_id_df' ]
>>> new
str_id match_val error str_id_df name error_df
id
3 None 12345 duplicated None foo None
6 ghij 67890 NaN None bar None
# the extra columns are ignored by update so no need to rename/remove
df1.update(new)
>>> df1
str_id match_val name error
id
3 None 12345 foo duplicated
6 ghij 67890 bar None
9 None None delta None
12 existing None leave None
Upvotes: 2