wolfblitza
wolfblitza

Reputation: 477

Python - Cleaning US and Canadian Zip Codes with `df.loc` and `str` Methods

I have the following code to create a column with cleaned up zip codes for the USA and Canada

df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','') 

The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.

I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work

usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'

Region  Date    State   Zip         Customer     Revenue
USA     1/3/2014    BC  A5Z 1B6     Customer A   $157.52 
Canada  1/13/2014   AB  Z3J-4E5     Customer B   $750.00 
USA     1/4/2014    FL  90210-9999  Customer C   $650.75 
USA     1/21/2014   FL  12345       Customer D   $242.00 
USA     1/25/2014   FL  45678       Customer E   $15.00 
USA     1/28/2014   NY  91011       Customer F   $25.00 

Upvotes: 1

Views: 1164

Answers (2)

Kris
Kris

Reputation: 420

Give this a try:

# sample df provided by OP
>>> df
   Region        Date State        Zip    Customer Revenue
0     USA  2014-01-03    BC    A5Z 1B6  Customer A  157.52
1  Canada  2014-01-13    AB    Z3J-4E5  Customer B     750
2     USA  2014-01-04    FL  90210-999  Customer C  650.75
3     USA  2014-01-21    FL      12345  Customer D     242
4     USA  2014-01-25    FL      45678  Customer E      15
5     USA  2014-01-28    NY      91011  Customer F      25

# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
   Region        Date State        Zip    Customer Revenue
0  Canada  2014-01-03    BC    A5Z 1B6  Customer A  157.52
1  Canada  2014-01-13    AB    Z3J-4E5  Customer B     750
2     USA  2014-01-04    FL  90210-999  Customer C  650.75
3     USA  2014-01-21    FL      12345  Customer D     242
4     USA  2014-01-25    FL      45678  Customer E      15
5     USA  2014-01-28    NY      91011  Customer F      25

# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')

# display resultant df
>>> df
   Region        Date State        Zip    Customer Revenue ZipCleaned
0  Canada  2014-01-03    BC    A5Z 1B6  Customer A  157.52     A5Z1B6
1  Canada  2014-01-13    AB    Z3J-4E5  Customer B     750     Z3J4E5
2     USA  2014-01-04    FL  90210-999  Customer C  650.75      90210
3     USA  2014-01-21    FL      12345  Customer D     242      12345
4     USA  2014-01-25    FL      45678  Customer E      15      45678
5     USA  2014-01-28    NY      91011  Customer F      25      91011

EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:

>>> df
   Region        Date State        Zip    Customer Revenue
0     USA  2014-01-03    BC    A5Z 1B6  Customer A  157.52
1  Canada  2014-01-13    AB    Z3J-4E5  Customer B     750
2     USA  2014-01-04    FL  90210-999  Customer C  650.75
3     USA  2014-01-21    FL     123456  Customer D     242
4     USA  2014-01-25    FL      45678  Customer E      15
5     USA  2014-01-28    NY      91011  Customer F      25

# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')

# Resultant df
>>> df
   Region        Date State        Zip    Customer Revenue ZipCleaned
0     USA  2014-01-03    BC    A5Z 1B6  Customer A  157.52     A5Z1B6
1  Canada  2014-01-13    AB    Z3J-4E5  Customer B     750     Z3J4E5
2     USA  2014-01-04    FL  90210-999  Customer C  650.75      90210
3     USA  2014-01-21    FL     123456  Customer D     242      12345
4     USA  2014-01-25    FL      45678  Customer E      15      45678
5     USA  2014-01-28    NY      91011  Customer F      25      91011
>>>

Upvotes: 0

wolfblitza
wolfblitza

Reputation: 477

Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working

usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]

df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')

Upvotes: 1

Related Questions