Reputation: 4292
I have a large dataframe (DF1) that contains a variable containing UK postcode data. Inevitably there are some typos in the data. However, after some work with regular expressions, I have created a second database that contains corrected versions of the postcode data (but only for those rows where the original postcode was incorrect) – DF2. (N.B. the index values are not necessarily consecutive.)
id postcode remark
0 1 L93AP Normal
2 2 LD38AH Normal
4 3 SO224ER Normal
6 4 SO21 Too short
8 5 DN379HJ Normal
10 6 M21ORH Zero replaced with O (oh)
12 7 NP745G S replaced with 5
14 8 SE136R2 Z replaced with 2
16 9 BN251ESBN Too long
18 10 TD152EH Normal
The dataframe containing the corrected data is:
0 1 2 3 pcCorrected
10 M21 0 R H M210RH
12 NP7 4 S G NP74SG
14 SE13 6 R Z SE136RZ
I want to combine the 2 databases such that the new values in the pcCorrected column in DF2 replace the old postcode values in the DF1 dataframe but, for other cells, the existing postcode values remain in tact. The final database should look like:
id postcode remark
0 1 L93AP Normal
2 2 LD38AH Normal
4 3 SO224ER Normal
6 4 SO21 Too short
8 5 DN379HJ Normal
10 6 M210RH Normal
12 7 NP74SG Normal
14 8 SE136RZ Normal
16 9 BN251ESBN Too long
18 10 TD152EH Normal
The databases are quite large (>1 million rows). Does this action have a name and what is the most efficient way to do this?
Upvotes: 1
Views: 1333
Reputation: 3436
You can replace the values of first dataframe df1
with the values of second dataframe df2
by doing the following steps.
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
Upvotes: 0
Reputation: 3751
maybe a simpler method would be:
mask = df1.index.isin(df2.index)
df1.loc[mask,'postcode'] = df2['pcCorrected']
this could avoid the additional step of merging the two DataFrames
Upvotes: 1
Reputation: 862471
You can try merge
by indexes
, create mask
by notnull
and add new values by loc
:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
mask = pd.notnull(df['pcCorrected'])
print mask
0 False
2 False
4 False
6 False
8 False
10 True
12 True
14 True
16 False
18 False
Name: pcCorrected, dtype: bool
df.loc[mask, 'remark'] = 'Normal'
df.loc[mask, 'postcode'] = df['pcCorrected']
print df[['id','postcode','remark']]
id postcode remark
0 1 L93AP Normal
2 2 LD38AH Normal
4 3 SO224ER Normal
6 4 SO21 Too short
8 5 DN379HJ Normal
10 6 M210RH Normal
12 7 NP74SG Normal
14 8 SE136RZ Normal
16 9 BN251ESBN Too long
18 10 TD152EH Normal
Upvotes: 1
Reputation: 4292
One way is to step through each row of the dataframe containing the corrected values and then transfer to the original dataframe:
for i in DF2.index:
DF1.ix[i,'postcode'] = DF2.ix[i,'pcCorrected']
Is there a more efficient way to do this?
Upvotes: 0