Reputation: 859
I have a dataframe that holds addresses, which are split in multiple columns:
address postalcode city province country
-----------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I want to split the address column into two separate columns, one for house number and one for street name. Therefore, after running it, the above df would look like:
house_no street postalcode city province country
----------------------------------------------------------------------------
123 Fake St F1A2K3 Fakeville ON CA
I have been doing this by simply using df[['house_no', 'street']] = df['address'].str.split(' ', 1, expand=True)
, which was working fine until I noticed that some addresses under the address
column are structured as Apt 316 555 Fake Drive
(or Unit 316 555 Fake Drive
). Therefore, when I run what I am currently using on those, I get:
house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Obviously, this is not want I want.
So essentially, I need an algorithm that splits the string after the first number, unless it starts with "Unit" or "Apt", in which case it will take the second number it sees and split that out into the house_no
column.
I need to do this without losing any information, therefore keeping the Unit/Apt number as well (that can be stored in the house_no
column, but ideally would have its own unit_no
column). Therefore, ideally, the output would look like:
unit_no house_no street postalcode city province country
---------------------------------------------------------------------------------
Apt 316 555 Fake Drive F1A2K3 Fakeville ON CA
Given that the original address
column contained Apt 316 555 Fake Drive
and is now split into unit_no
,house_no
, and street
.
I am not sure where to start with this, so any help would be appreciated.
Upvotes: 0
Views: 127
Reputation: 86
I am not sure I understood the question, but if you want to eliminate the words Apt or Unit this will do it (here df and df2 are two .xlsx files I made, and df2 is just another dataframe with the columns you need, (house_no and street) and with as many rows as df but with empty values):
import pandas as pd
df=pd.read_excel('raspuns_so.xlsx')
df2=pd.read_excel('sol.xlsx')
tmp=df['add'].str.split(' ', 1, expand=True)
for i, row_series in df2.iterrows():
if tmp[0][i].isdigit():
df2[['house_no', 'street']] = df['add'].str.split(' ', 1, expand=True)
else:
var=tmp[1][i].split(' ')
arr=[var[0],var[1]]
df2.at[i,'house_no'] = " ".join(arr)
df2.at[i,'street'] = var[2]
print df2
My df:
address pc city province country
0 123 Fake ST F1A2K3 Fakeville ON CA
1 Apt 123 555 FakeST 300000 Fakeville OFF USA
My df2:
house_no street pc city province country
0 0 0 0 0 0 0
1 0 0 0 0 0 0
df2 after I ran the code:
house_no street pc city province country
0 123 Fake ST 0 0 0 0
1 123 555 FakeST 0 0 0 0
Upvotes: 0
Reputation: 150785
Let's try this data:
df = pd.DataFrame({'address':['123 Fake Street', 'Apt 316 555 Fake Drive']})
# df
# address
# 0 123 Fake Street
# 1 Apt 316 555 Fake Drive
Since you did not specify if you want to capture Unit\Apt
number, I assume you do not:
df.address.str.extract('(?:Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
house_no street
0 123 Fake Street
1 555 Fake Drive
Only slight modification needed if you want to keep Unit/Apt
:
df.address.str.extract('(?P<unit_no>Unit|Apt \d+ )?(?P<house_no>\d+) (?P<street>.*)$')
Output:
unit_no house_no street
0 NaN 123 Fake Street
1 Apt 316 555 Fake Drive
Upvotes: 1
Reputation: 497
If you always have a number followed by a space then street name, you could use the str.split(' ') function on the data in address
For example, make a new column with streeet name, a new column with street number
create two arrays, one with street number by using for example number = address.split(' ')
number[0] will always be the street number
since some street names have spaces, append number[1:] together and that is your data for the street name column
sorry for the psuedo code, in a rush.
Upvotes: 0
Reputation: 597
you can you the df.loc
function, this should work.
df.loc[~df['address'].str.contains('Unit|Apt'), 'house_no'] = df['address'].str.split(' ')
Upvotes: 0