Reputation: 1161
Given a data frame with 9,000+ records (rows), a column with unique IDs, and each ID can have multiple records as shown below.
df.head(4)
Unique_ID | Record_1 | Record_2 |
---|---|---|
AN5001 | 90.0 | ten |
AN5002 | 90.0 | five |
AN5001 | 95.0 | five |
AN5003 | 60.0 | forty |
There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)
Unique_ID_old | Unique_ID_new |
---|---|
AN5001 | AN5010 |
AN5002 | AN5002 |
AN5003 | AN5011 |
How would you, most efficiently, fix the Unique ID in the main df
with 9,000+ records using the df_corrected_ID
data frame?
So, check if Unique_ID_old is present in df['Unique_ID']
column and if it is, replace it with Unique_ID_new from df_corrected_ID
.
How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new))
.
It's okay to add another new column to original df
if needed with corrected IDs, as long as the order is maintained an none of the records are changed.
Thanks!
Upvotes: 0
Views: 373
Reputation: 692
It is simple mapping. You can add a column or update in place.
df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011']))
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))
df w/ new col
| | uni | rec1 | rec2 | uni_update |
|---:|:-------|-------:|:-------|:-------------|
| 0 | AN5001 | 1 | a | AN5010 |
| 1 | AN5002 | 2 | b | AN5002 |
| 2 | AN5001 | 3 | c | AN5010 |
| 3 | AN5003 | 4 | d | AN5011 |
to check afterwards
df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)
| | uni | uni_update |
|---:|:-------|:-------------|
| 0 | AN5001 | AN5010 |
| 1 | AN5002 | AN5002 |
| 2 | AN5003 | AN5011 |
df_map.equals(df_record_reduced) #should be true if everything is aligned
Upvotes: 2
Reputation: 2516
You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.
update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])
# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)
# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)
Upvotes: 1