Omega
Omega

Reputation: 871

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.

Input data:

    Address Text
0   11 North Warren Circle Lisbon Falls ME 04252
1   227 Cony Street Augusta ME 04330
2   70 Buckner Drive Battle Creek MI
3   718 Perry Street Big Rapids MI
4   14857 Martinsville Road Van Buren MI
5   823 Woodlawn Ave Dallas TX 75208
6   2525 Washington Avenue Waco TX 76710
7   123 South Main St Dallas TX 75201

The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)

    City          State    Postcode 
0   Lisbon Falls  ME       04252
1   Augusta       ME       04330

My code:

# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)

# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]

# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Upvotes: 2

Views: 99

Answers (1)

RaidasGrisk
RaidasGrisk

Reputation: 444

Here's a way to do that:

import pandas as pd

# data
df = pd.DataFrame(
    ['11 North Warren Circle Lisbon Falls ME 04252',
     '227 Cony Street Augusta ME 04330',
     '70 Buckner Drive Battle Creek MI',
     '718 Perry Street Big Rapids MI',
     '14857 Martinsville Road Van Buren MI',
     '823 Woodlawn Ave Dallas TX 75208',
     '2525 Washington Avenue Waco TX 76710',
     '123 South Main St Dallas TX 75201'],
    columns=['Address Text'])

# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)

# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]


def find_city(address, state, street_synonyms):
    for syn in street_synonyms:
        if syn in address:
            # remove street
            city = address.split(syn)[-1]
            # remove State and postcode
            city = city.split(state)[0]
            return city


df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)

print(df[['City', 'State', 'Zip']])

"""
             City State    Zip
0   Lisbon Falls     ME  04252
1        Augusta     ME  04330
2   Battle Creek     MI    NaN
3     Big Rapids     MI    NaN
4      Van Buren     MI  14857
5         Dallas     TX  75208
6       nue Waco     TX  76710
7         Dallas     TX  75201
"""

Upvotes: 1

Related Questions