Reputation: 37
I have scraped some websites to gather company data. The address data is one of them. Due to the HTML tag I was only able to scrape the data within one 'tag'. An example is of the output of my data can be seen below.
Streetname housenumber zip-code city country
Street 1 1234 AB Amsterdam Netherlands
Longerstreetname 22 9876 XY Den Haag Netherlands
Name: Address, Length: 314, dtype: object
Now, I need to extract the ZIP code (only the zip code) into a new column for further analysis. I am mostly using pandas within my data cleaning phase. (I need to find out in what province every company is located)
I have searched for numerous options to find a method to extract the zip code, hence I did not succeed. Any help would be very much appreciated!
Upvotes: 0
Views: 4704
Reputation: 210
I think you can use regex.
Example:
import re
address = '7802 Grant Avenue Egg Harbor Township, NJ 08234'
us_zip = r'(\d{5}\-?\d{0,4})'
zip_code = re.search(us_zip, address)
zip_code.group(1)
Important note: There is no specific pattern for zip code around the world. If you want to scrape companies from different countries, you should implement regex for all of them.
Hope this file could help you. zip codes regex
Upvotes: 3
Reputation: 882
If the sample output posted in the question are the values in a column named Address
of type object
in a dataframe, then a new column with extracted zip codes can be created as follows:
df['Zip Code'] = " ".join(str(df['Address']).split(" ")[2:4])
Upvotes: 0