Reputation: 44
I am facing some hard challenge in data cleaning using pandas. I need to work on the ZIP, State and City columns in the dataframe (as attached).
This is the glimpse of the data:
City State ZIP
0 PEARLAND TX 77584
1 HOT SPRINGS AR 7191
2 PALMDALE CA 93551
3 JONESBOROY AR 72401
4 WATSONVILLE CA 95076
5 SACRAMENTO UCA 95823
6 AGOURAT HILLS CA 91301
7 JSAN DIEGO CA 92114
8 NEW PORT RICHEY FL 34655
9 BURBANK CA 91501
Challenge I am facing: The Zip column should contain only the Zip number, The state column should have only 2 digit state code, and city should have valid city name.
For example:
City State ZIP
6 AGOURAT HILLS CA 91301
7 JSAN DIEGO CA 92114
8 NEW PORT RICHEY FL 34655
should be -
City State ZIP
6 AGOURAT HILLS CA 91301
7 JSAN DIEGO CA 92114
8 NEW PORT RICHEY FL 34655
So can anyone help me how can I achieve this using Pandas dataframe?
Upvotes: 1
Views: 254
Reputation: 38415
The best option would be to read the data once again using regex. If not, redo.
df['combined'] = df.apply(' '.join, axis = 1)
df[['City', 'State', 'Zip']] = df['combined'].str.extract('([A-Z ]+)\s+([A-Z]{2})\s+(\d+)', expand = True)
df.drop('combined', axis = 1, inplace = True)
City State ZIP
0 PEARLAND TX 77584
1 HOT SPRINGS AR 7191
2 PALMDALE CA 93551
3 JONESBOROY AR 72401
4 WATSONVILLE CA 95076
5 NaN NaN NaN
6 AGOURAT HILLS CA 91301
7 JSAN DIEGO CA 92114
8 NEW PORT RICHEY FL 34655
9 BURBANK CA 91501
Upvotes: 1