Reputation: 149
I have a dataframe with full addresses in a column, and I need to create a separate column with just the postal code of 5 digits starting by 7 in the same dataframe. Some of the addresses may be empty or postal code not found.
How do I split the column to just get the postal code? the postal code start with 7 for example 76000 is the postal code in index 0
MedicalCenters["Postcode"][0]
Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
Example Data
Venue Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora 20.595621 -100.392677 Medical Center Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))
I tried using regex but I get and error
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
ERROR
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
1 # get zipcode from full address
2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1950 )
1951 raise TypeError(msg)
-> 1952 return func(self, *args, **kwargs)
1953
1954 wrapper.__name__ = func_name
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
3037 @forbid_nonstring_types(["bytes"])
3038 def extract(self, pat, flags=0, expand=True):
-> 3039 return str_extract(self, pat, flags=flags, expand=expand)
3040
3041 @copy(str_extractall)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
1010 return _str_extract_frame(arr._orig, pat, flags=flags)
1011 else:
-> 1012 result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
1013 return arr._wrap_result(result, name=name, expand=expand)
1014
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
871
872 regex = re.compile(pat, flags=flags)
--> 873 groups_or_na = _groups_or_na_fun(regex)
874
875 if regex.groups == 1:
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
835 """Used in both extract_noexpand and extract_frame"""
836 if regex.groups == 0:
--> 837 raise ValueError("pattern contains no capture groups")
838 empty_row = [np.nan] * regex.groups
839
ValueError: pattern contains no capture groups
time: 39.5 ms
Upvotes: 2
Views: 3221
Reputation: 149
Data of Address were an object thats why the regex was not working
MedicalCenters.dtypes
Venue object
Venue Latitude float64
Venue Longitude float64
Venue Category object
Health System object
geom object
Address object
Postcode object
dtype: object
time: 6.41 ms
after convert object to string :
MedicalCenters['Address'] = MedicalCenters['Address'].astype('str')
I was able to apply the regex modified thanks to glam
# get zipcode from full address
import re
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
Upvotes: 0
Reputation: 116
You need to add parentheses to get make it a group
MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")
Upvotes: 4
Reputation: 809
You can try to split the string first, then it will be easier to match the postcode:
address = '75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0'
matches = list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', '))) # ['76000']
So you can populate your DataFrame by:
df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', ')))[0])
Upvotes: 0