Novice
Novice

Reputation: 1161

Modify pandas data frame values based on condition

Given a data frame with 9,000+ records (rows), a column with unique IDs, and each ID can have multiple records as shown below.

df.head(4)

Unique_ID Record_1 Record_2
AN5001 90.0 ten
AN5002 90.0 five
AN5001 95.0 five
AN5003 60.0 forty

There are 360 unique IDs. However, about half of them need to be corrected. Consider below df_corrected_ID.head(3)

Unique_ID_old Unique_ID_new
AN5001 AN5010
AN5002 AN5002
AN5003 AN5011

How would you, most efficiently, fix the Unique ID in the main df with 9,000+ records using the df_corrected_ID data frame?

So, check if Unique_ID_old is present in df['Unique_ID'] column and if it is, replace it with Unique_ID_new from df_corrected_ID.

How would one then check change occurred correctly (for example, just show the difference between the Unique_IDs -- say after converting the original and updated columns to lists and then, list(set(Unique_ID) - set(Unique_ID_new)).

It's okay to add another new column to original df if needed with corrected IDs, as long as the order is maintained an none of the records are changed.

Thanks!

Upvotes: 0

Views: 373

Answers (2)

crashMOGWAI
crashMOGWAI

Reputation: 692

It is simple mapping. You can add a column or update in place.

df_map = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5003'], uni_update=['AN5010','AN5002','AN5011'])) 
df_record = pd.DataFrame(dict(uni=['AN5001','AN5002','AN5001','AN5003'], rec1=[1,2,3,4], rec2=['a','b','c','d']))
    
df_record['uni_update'] = df_record.uni.map(dict(zip(df_map.uni, df_map.uni_update)))

df w/ new col

|    | uni    |   rec1 | rec2   | uni_update   |
|---:|:-------|-------:|:-------|:-------------|
|  0 | AN5001 |      1 | a      | AN5010       |
|  1 | AN5002 |      2 | b      | AN5002       |
|  2 | AN5001 |      3 | c      | AN5010       |
|  3 | AN5003 |      4 | d      | AN5011       |

to check afterwards

df_record_reduced = df_record.drop_duplicates(subset='uni').loc[:,['uni','uni_update']].reset_index(drop=True)

|    | uni    | uni_update   |
|---:|:-------|:-------------|
|  0 | AN5001 | AN5010       |
|  1 | AN5002 | AN5002       |
|  2 | AN5003 | AN5011       |

df_map.equals(df_record_reduced) #should be true if everything is aligned

Upvotes: 2

jprebys
jprebys

Reputation: 2516

You can use Series.replace for this. Make your corrected ID DataFrame into a dictionary that maps old IDs to new ones, and you can use it to update your existing column or create a new one.

update_dict = dict(zip(df_corrected_ID['Unique_ID_old'], df_corrected_ID['Unique_ID_new'])

# to create a new column
df['Unique_ID_updated'] = df['Unique_ID'].replace(update_dict)

# to update the existing column
df['Unique_ID'] = df['Unique_ID'].replace(update_dict)

Upvotes: 1

Related Questions