Reputation: 859

Split columns after first number with condition

I have a dataframe that holds addresses, which are split in multiple columns:

address        postalcode      city       province       country
-----------------------------------------------------------------
123 Fake St     F1A2K3       Fakeville      ON              CA

I want to split the address column into two separate columns, one for house number and one for street name. Therefore, after running it, the above df would look like:

house_no    street        postalcode      city       province       country
----------------------------------------------------------------------------
  123       Fake St        F1A2K3       Fakeville      ON              CA

I have been doing this by simply using df[['house_no', 'street']] = df['address'].str.split(' ', 1, expand=True), which was working fine until I noticed that some addresses under the address column are structured as Apt 316 555 Fake Drive (or Unit 316 555 Fake Drive). Therefore, when I run what I am currently using on those, I get:

house_no          street        postalcode      city       province       country
---------------------------------------------------------------------------------
  Apt       316 555 Fake Drive     F1A2K3       Fakeville      ON           CA

Obviously, this is not want I want.

So essentially, I need an algorithm that splits the string after the first number, unless it starts with "Unit" or "Apt", in which case it will take the second number it sees and split that out into the house_no column.

I need to do this without losing any information, therefore keeping the Unit/Apt number as well (that can be stored in the house_no column, but ideally would have its own unit_no column). Therefore, ideally, the output would look like:

unit_no    house_no   street    postalcode     city      province   country
---------------------------------------------------------------------------------
 Apt 316     555    Fake Drive    F1A2K3     Fakeville     ON         CA

Given that the original address column contained Apt 316 555 Fake Drive and is now split into unit_no,house_no, and street.

I am not sure where to start with this, so any help would be appreciated.

Upvotes: 0

Answers (4)

rooky

Reputation: 86

I am not sure I understood the question, but if you want to eliminate the words Apt or Unit this will do it (here df and df2 are two .xlsx files I made, and df2 is just another dataframe with the columns you need, (house_no and street) and with as many rows as df but with empty values):

import pandas as pd

df=pd.read_excel('raspuns_so.xlsx')

df2=pd.read_excel('sol.xlsx')
tmp=df['add'].str.split(' ', 1, expand=True)
for i, row_series in df2.iterrows():
    if tmp[0][i].isdigit():
        df2[['house_no', 'street']] = df['add'].str.split(' ', 1, expand=True)
    else:
        var=tmp[1][i].split(' ')
        arr=[var[0],var[1]]
        df2.at[i,'house_no'] = " ".join(arr)
        df2.at[i,'street'] = var[2]
print df2

My df:

              address      pc       city province country
0         123 Fake ST  F1A2K3  Fakeville       ON      CA
1  Apt 123 555 FakeST  300000  Fakeville      OFF     USA

My df2:

   house_no  street  pc  city  province  country
0         0       0   0     0         0        0
1         0       0   0     0         0        0

df2 after I ran the code:

  house_no   street  pc  city  province  country
0      123  Fake ST   0     0         0        0
1  123 555   FakeST   0     0         0        0

Upvotes: 0

Quang Hoang

Reputation: 150785

Let's try this data:

df = pd.DataFrame({'address':['123 Fake Street', 'Apt 316 555 Fake Drive']})

# df
#                   address
# 0         123 Fake Street
# 1  Apt 316 555 Fake Drive

Since you did not specify if you want to capture Unit\Apt number, I assume you do not:

df.address.str.extract('(?:Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')

Output:

  house_no       street
0      123  Fake Street
1      555   Fake Drive

Only slight modification needed if you want to keep Unit/Apt:

df.address.str.extract('(?P<unit_no>Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')

Output:

    unit_no house_no       street
0       NaN      123  Fake Street
1  Apt 316       555   Fake Drive

Upvotes: 1

AlbinoRhino

Reputation: 497

If you always have a number followed by a space then street name, you could use the str.split(' ') function on the data in address

For example, make a new column with streeet name, a new column with street number

create two arrays, one with street number by using for example number = address.split(' ')

number[0] will always be the street number

since some street names have spaces, append number[1:] together and that is your data for the street name column

sorry for the psuedo code, in a rush.

Upvotes: 0

SRT HellKitty

Reputation: 597

you can you the df.loc function, this should work.

df.loc[~df['address'].str.contains('Unit|Apt'), 'house_no'] = df['address'].str.split(' ')

Upvotes: 0

Split columns after first number with condition

Answers (4)

Related Questions