Reputation: 2606
I'm attempting to clean up some of the Data that I have from an excel file. The file contains 7400 rows and 18 columns, which includes a list of customers with their respective addresses and other data. The problem that I'm encountering is that some of the cities are misspelled which distorts the information and makes it difficult for further processing.
SURNAME | ADDRESS | CITY
0 Jenson | 252 Des Chênes | D.DO
1 Jean | 236 Gouin | DOLLARD
2 Denis | 993 Boul. Gouin | DOLLARD-DES-ORMEAUX
3 Bradford | 1690 Dollard #7 | DDO
4 Alisson | 115 Du Buisson | IL PERROT
5 Abdul | 9877 Boul. Gouin | Pierrefonds
6 O'Neil | 5 Du College | Ile Bizard
7 Bundy | 7345 Sherbrooke | ILLE Perot
8 Darcy | 8671 Anthony #2 | ILE Perrot
9 Adams | 845 Georges | Pierrefonds
In the above example D.DO, DOLLARD, DDO should be spelled DOLLARD-DES-ORMEAUX and IL PERROT, ILLE PEROT, ILE PERROT should be spelled ILE-PERROT.
I've been able to replace the values using:
df["CITY"].replace(to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", regex=True)
df["CITY"].replace(to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT", regex=True)
Is there some way of combining the above operations into one? I've tried:
df["CITY"].replace({to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT"}, regex=True)
but I've had no luck
Upvotes: 14
Views: 28571
Reputation: 210832
try .replace({}, regex=True)
method:
replacements = {
'CITY': {
r'(D.*DO|DOLLARD.*)': 'DOLLARD-DES-ORMEAUX',
r'I[lL]*[eE]*.*': 'ILLE Perot'}
}
df.replace(replacements, regex=True, inplace=True)
print(df)
Output:
SURNAME ADDRESS CITY
0 Jenson 252 Des Chênes DOLLARD-DES-ORMEAUX
1 Jean 236 Gouin DOLLARD-DES-ORMEAUX
2 Denis 993 Boul. Gouin DOLLARD-DES-ORMEAUX
3 Bradford 1690 Dollard #7 DOLLARD-DES-ORMEAUX
4 Alisson 115 Du Buisson ILLE Perot
5 Abdul 9877 Boul. Gouin Pierrefonds
6 O'Neil 5 Du College ILLE Perot
7 Bundy 7345 Sherbrooke ILLE Perot
8 Darcy 8671 Anthony #2 ILLE Perot
9 Adams 845 Georges Pierrefonds
Upvotes: 26
Reputation: 109526
You can create a dictionary of replacements and then iterate through them, using 'loc' for replacement.
target_for_values = {
'DOLLARD-DES-ORMEAUX': ['D.DO', 'DOLLARD', 'DDO'],
'ILE-PERROT': ['IL PERROT', 'ILLE PEROT', 'ILE PERROT']}
for k, v in target_for_values.iteritems():
df.loc[df.CITY.str.upper().isin(v), 'CITY'] = k
>>> df.CITY
CITY
0 C.DO
1 DOLLARD-DES-ORMEAUX
2 DOLLARD-DES-ORMEAUX
3 DOLLARD-DES-ORMEAUX
4 ILE-PERROT
5 Pierrefonds
6 Ile Bizard
7 ILE-PERROT
8 ILE-PERROT
9 Pierrefonds
Upvotes: 4