Reputation: 477
I have the following code to create a column with cleaned up zip codes for the USA and Canada
df = pd.read_csv(file1)
usa = df['Region'] == 'USA'
canada = df['Region'] == 'Canada'
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
The issue that i am having is that some of the rows that have "USA" as the country contain Canadian postal codes in the dataset. So the USA logic from above is being applied to Canadian postal codes.
I tried the edited code above along with the below and experimented with one province ("BC") to prevent the USA logic from being applied in this case but it didn't work
usa = df['Region'] == 'USA'
usa = df['State'] != 'BC'
Region Date State Zip Customer Revenue
USA 1/3/2014 BC A5Z 1B6 Customer A $157.52
Canada 1/13/2014 AB Z3J-4E5 Customer B $750.00
USA 1/4/2014 FL 90210-9999 Customer C $650.75
USA 1/21/2014 FL 12345 Customer D $242.00
USA 1/25/2014 FL 45678 Customer E $15.00
USA 1/28/2014 NY 91011 Customer F $25.00
Upvotes: 1
Views: 1164
Reputation: 420
Give this a try:
# sample df provided by OP
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# Edit 'Region' by testing 'Zip' for presence of letters (US Zip Codes are only numeric)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'Region'] = 'Canada'
>>> df
Region Date State Zip Customer Revenue
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 12345 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# apply OP's original filtering and cleaning
>>> usa = df['Region'] == 'USA'
>>> canada = df['Region'] == 'Canada'
>>> df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Zip'].str.slice(stop=5)
>>> df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Zip'].str.replace(' |-','')
# display resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 Canada 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 12345 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
EDIT: Update as requested by OP: we can do the following to leave the original 'Region' intact:
>>> df
Region Date State Zip Customer Revenue
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750
2 USA 2014-01-04 FL 90210-999 Customer C 650.75
3 USA 2014-01-21 FL 123456 Customer D 242
4 USA 2014-01-25 FL 45678 Customer E 15
5 USA 2014-01-28 NY 91011 Customer F 25
# create 'ZipCleaned' by referencing original 'Zip'
>>> df.loc[~df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.slice(stop=5)
>>> df.loc[df['Zip'].str.contains('[A-Za-z]'), 'ZipCleaned'] = df['Zip'].str.replace(' |-', '')
# Resultant df
>>> df
Region Date State Zip Customer Revenue ZipCleaned
0 USA 2014-01-03 BC A5Z 1B6 Customer A 157.52 A5Z1B6
1 Canada 2014-01-13 AB Z3J-4E5 Customer B 750 Z3J4E5
2 USA 2014-01-04 FL 90210-999 Customer C 650.75 90210
3 USA 2014-01-21 FL 123456 Customer D 242 12345
4 USA 2014-01-25 FL 45678 Customer E 15 45678
5 USA 2014-01-28 NY 91011 Customer F 25 91011
>>>
Upvotes: 0
Reputation: 477
Thanks Kris. But what if I wanted to maintain the original values in the Region column and change "Zip Cleaned" based on whether Zip contains a Canadian or USA Zip. I tried the following but it's not working
usa = df.loc[df['Ship To Customer Zip'].str.contains('[0-9]')]
canada = df.loc[df['Ship To Customer Zip'].str.contains('[A-Za-z]')]
df.loc[usa, 'ZipCleaned'] = df.loc[usa, 'Ship To Customer Zip'].str.slice(stop=5)
df.loc[canada, 'ZipCleaned'] = df.loc[canada, 'Ship To Customer Zip'].str.replace(' |-','')
Upvotes: 1