Get postal code from full address column in dataframe by regex str.extract() and add as new column in pandas

I have a dataframe with full addresses in a column, and I need to create a separate column with just the postal code of 5 digits starting by 7 in the same dataframe. Some of the addresses may be empty or postal code not found.

How do I split the column to just get the postal code? the postal code start with 7 for example 76000 is the postal code in index 0

MedicalCenters["Postcode"][0]
Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))

Example Data

    Venue         Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora  20.595621   -100.392677      Medical Center Location(75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0))

I tried using regex but I get and error

# get zipcode from full address
import re 
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False) 

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
      1 # get zipcode from full address
      2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b', expand=False)

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs)
   1950                 )
   1951                 raise TypeError(msg)
-> 1952             return func(self, *args, **kwargs)
   1953 
   1954         wrapper.__name__ = func_name

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self, pat, flags, expand)
   3037     @forbid_nonstring_types(["bytes"])
   3038     def extract(self, pat, flags=0, expand=True):
-> 3039         return str_extract(self, pat, flags=flags, expand=expand)
   3040 
   3041     @copy(str_extractall)

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr, pat, flags, expand)
   1010         return _str_extract_frame(arr._orig, pat, flags=flags)
   1011     else:
-> 1012         result, name = _str_extract_noexpand(arr._parent, pat, flags=flags)
   1013         return arr._wrap_result(result, name=name, expand=expand)
   1014 

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr, pat, flags)
    871 
    872     regex = re.compile(pat, flags=flags)
--> 873     groups_or_na = _groups_or_na_fun(regex)
    874 
    875     if regex.groups == 1:

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
    835     """Used in both extract_noexpand and extract_frame"""
    836     if regex.groups == 0:
--> 837         raise ValueError("pattern contains no capture groups")
    838     empty_row = [np.nan] * regex.groups
    839 

ValueError: pattern contains no capture groups

time: 39.5 ms

Upvotes: 2

Views: 3221

Answers (3)

Data of Address were an object thats why the regex was not working

MedicalCenters.dtypes
Venue               object
Venue Latitude     float64
Venue Longitude    float64
Venue Category      object
Health System       object
geom                object
Address             object
Postcode            object
dtype: object
time: 6.41 ms

after convert object to string :

MedicalCenters['Address'] = MedicalCenters['Address'].astype('str') 

I was able to apply the regex modified thanks to glam

# get zipcode from full address
import re 
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")

OUTPUT

Upvotes: 0

glam
glam

Reputation: 116

You need to add parentheses to get make it a group

MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")

Upvotes: 4

Kassian Sun
Kassian Sun

Reputation: 809

You can try to split the string first, then it will be easier to match the postcode:

address = '75, Avenida Corregidora, Centro, Delegación Centro Histórico, Santiago de Querétaro, Municipio de Querétaro, Querétaro, 76000, México, (20.5955795, -100.39274225, 0.0'

matches = list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', '))) # ['76000']

So you can populate your DataFrame by:

df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5, address.split(', ')))[0])

Upvotes: 0

Related Questions