DudeWah
DudeWah

Reputation: 361

How to extract apartment from address in Pandas

I have a rather messy data set that has a lot of inconsistencies and errors due to manually input data.

I'm working on the address column of this dataset in pandas.

What I would like to do is break the address column into 3 separate entities:

1) a column for the address

2) a column for the street number

3) a column for the apartment or unit number

The data looks like the following:

address
----------------------
123 smith street #5234
5000 john ct
34 wood st apt# 23
523 fire road apt #87
charles way apt. 434
0987 misty lane unit B 

I have already removed the street numbers into their own column. For this I used "np.where" using an easy logical condition that if the string started with digits extract them into the new street column.

I am now stuck on how to do this with apartment numbers.

I am assuming that because of the inconsistencies, I have to do something like:

df['apt/unit'] = np.where(str contains "apt", extract string starting at "apt" until end, else np.NaN)
df['apt/unit'] = np.where(str contains "unit", extract string starting at "unit" until end, else np.NaN)

Will I have to use regex to do this? If so, what is the way to go about that?

Are there any alternatives to this line of thinking?

Upvotes: 1

Views: 2332

Answers (4)

scratchpad
scratchpad

Reputation: 286

If you want to use a regex, here's a way to do it that matches number, street, and apartment in separate named groups and then uses df.apply to populate new columns of the dataframe. Street name types are listed out brute force. Anything left over is the apartment number.

regex = '(?P<num>\d+)? ?(?P<street>\w+ (street|way|st|road|ct|lane)) ?(?P<apt>.+)?'

def get_num(address):
    match = re.match(regex, address)
    return match.group('num')


def get_street(address):
    match = re.match(regex, address)
    return match.group('street')


def get_apt(address):
    match = re.match(address)
    return match.group('apt')


df['num'] = df['address'].apply(get_num)
df['street'] = df['address'].apply(get_street)
df['apt'] = df['address'].apply(get_apt)

Upvotes: 1

Chris Adams
Chris Adams

Reputation: 18647

Using Series.str.extract, something like:

pat = r'^(?P<number>\d+)?(?P<street>.+(?=\bapt|\bunit)|.+(?=#)|.+)(?P<apt_unit>(?:\bapt|\bunit|#).+)?'
df.address.str.extract(pat)

[out]

  number          street  apt_unit
0    123   smith street      #5234
1   5000         john ct       NaN
2     34        wood st    apt# 23
3    523      fire road    apt #87
4    NaN    charles way   apt. 434
5   0987     misty lane     unit B

Upvotes: 2

Erfan
Erfan

Reputation: 42916

Since you have multiple conditions for your apt/unit column, you can use np.select here like the following:

# Define our conditions
conditions = [
    df.address.str.contains('apt'),
    df.address.str.contains('unit'),
    df.address.str.contains('#')
]

# Define our choices based on our conditions
choices = [
    df.address.apply(lambda x: x[x.find('apt'):]),
    df.address.apply(lambda x: x[x.find('unit'):]),
    df.address.apply(lambda x: x[x.find('#'):])
]

# Apply this logic by creating the new column and cleaning up address column
df['apt/unit'] = np.select(conditions, choices, default = '')

# Clean up our address column
choices2 = [
    df.address.apply(lambda x: x[:x.find('apt')]),
    df.address.apply(lambda x: x[:x.find('unit')]),
    df.address.apply(lambda x: x[:x.find('#')])
]
df['address'] = np.select(conditions, choices2, default = df.address)

Output

print(df)

             address  apt/unit
0  123 smith street      #5234
1       5000 john ct          
2        34 wood st    apt# 23
3     523 fire road    apt #87
4       charles way   apt. 434
5   0987 misty lane     unit B

Upvotes: 2

SerAlejo
SerAlejo

Reputation: 493

If you already have removed all street numbers the only numbers left should be apartment numbers. Therefore you can use a simple regex to extract all numbers that are left in this column:

df['apt_number'] = df.address.str.extract(r'([\d]+)')

edit: With the above answer alone the numbers will remain in the addresses column, to remove we simply replace them with an empty string:

df['address'] = df.address.str.replace(r'([\d]+)', '')

Upvotes: 0

Related Questions