richard
richard

Reputation: 1585

Assign a column from another dataframe using the index in Pandas without destroying data in other rows

I have a dataframe containing some data that may or may not contain values in the str_id column (note the index on this dataframe). I need to check this dataframe against another dataframe containing cached data. If there is a single matching (based on match_val) row in the cache then I need to copy that rows str_id back into the original dataframe's column without destroying any existing data in the column. If there is more than one matching row, it should add an error to the original dataframe. These requirements are because, later on, I'll be adding extra functionality to do further cache checks and matches on other columns (e.g. name) if the first cache check returns nothing.

DF1
      str_id match_val   name
id
3       None     12345    foo
6       None     67890    bar
9       None      None  delta
12  existing      None  leave

CACHE
  str_id match_val     name
0  abcde     12345    alpha
1  edcba     12345     beta
2   ghij     67890    gamma
3   uofd     11111    delta
4   kfsl     11111  epsilon
5    xyz      None    theta

The desired result should be,

DF1
      str_id match_val   name       error
id
3       None     12345    foo   duplicate
6       ghij     67890    bar        None
9       None      None  delta        None
12  existing      None  leave        None

I have the following code so far that will correctly calculate single and multiple matches,

df1["legacy_id"] = df1.index

merged = pandas.merge(left=df1[df1['match_val'].notnull()], right=df_cache[df_cache['match_val'].notnull()], on='match_val', how='left', suffixes=['_df', '_cache'])
merge_count = merged.groupby('match_val').size()

errors = merged[merged['match_val'].isin(merge_count[merge_count>1].index)][['legacy_id']]
errors['error'] = 'Duplicate match_val'
errors.set_index('legacy_id', inplace=True)
errors = errors[~errors.index.duplicated()]
matches = merged[merged['match_val'].isin(merge_count[merge_count==1].index)][['legacy_id', 'match_val', 'str_id_cache']]
matches.set_index('legacy_id', inplace=True)

But I can't figure out how to incorporate the data correctly back into the original dataframe. If I assign the columns it destroys the data in the row with the existing str_id (see below). And I would assume that in subsequent cache checks this would be true for both the str_id and error columns. So how can I assign the data only for rows whose indexes match? Or is there another completely different way that I should be attempting this?

df1['str_id'] = matches['str_id_cache']
df1['error'] = errors['error']


DF1
   str_id match_val   name  legacy_id                error
id
3     NaN     12345    foo          3  Duplicate match_val
6    ghij     67890    bar          6                  NaN
9     NaN      None  delta          9                  NaN
12    NaN      None  leave         12                  NaN

Upvotes: 1

Views: 873

Answers (1)

user15398259
user15398259

Reputation:

You can use .update() to make changes where the indexes match - it also matches on column names - so you'd need to .rename()

df1.update(matches.rename(columns={'str_id_cache': 'str_id'}))

>>> df1
      str_id match_val   name
id                           
3       None     12345    foo
6       ghij     67890    bar
9       None      None  delta
12  existing      None  leave

You could also set it up to have the error and matches together and perform a single update.

You can add the duplicated information to the cache.

df2.loc[ df2.match_val.duplicated(keep=False), 'error' ] = 'duplicated'

>>> df2
  str_id match_val     name       error
0  abcde     12345    alpha  duplicated
1  edcba     12345     beta  duplicated
2   ghij     67890    gamma         NaN
3   uofd     11111    delta  duplicated
4   kfsl     11111  epsilon  duplicated
5    xyz      None    theta         NaN

Full example:

df1['error'] = None
df2.loc[ df2.match_val.duplicated(keep=False), 'error' ] = 'duplicated'
new = df2.drop(columns='name').dropna(subset=['match_val']).drop_duplicates('match_val')

>>> new
  str_id match_val       error
0  abcde     12345  duplicated
2   ghij     67890         NaN
3   uofd     11111  duplicated

new = new.merge(df1.reset_index(), on='match_val', suffixes=['', '_df']).set_index('id')

>>> new
   str_id match_val       error str_id_df name error_df
id                                                     
3   abcde     12345  duplicated      None  foo     None
6    ghij     67890         NaN      None  bar     None


# if a row is duplicated we want str_id_df as str_id
# new.str_id = numpy.where(new.error == 'duplicated', new.str_id_df, new.str_id)
# new.str_id.mask(new.error == 'duplicated', new.str_id_df, inplace=True)
# new.str_id = new.str_id.mask(new.error == 'duplicated', new.str_id_df)

rows = new.error == 'duplicated'
new.loc[ rows, 'str_id' ] = new.loc[ rows, 'str_id_df' ]

>>> new
   str_id match_val       error str_id_df name error_df
id                                                     
3    None     12345  duplicated      None  foo     None
6    ghij     67890         NaN      None  bar     None

# the extra columns are ignored by update so no need to rename/remove
df1.update(new)

>>> df1
      str_id match_val   name       error
id                                       
3       None     12345    foo  duplicated
6       ghij     67890    bar        None
9       None      None  delta        None
12  existing      None  leave        None

Upvotes: 2

Related Questions