Reputation: 361
I have a rather messy data set that has a lot of inconsistencies and errors due to manually input data.
I'm working on the address column of this dataset in pandas.
What I would like to do is break the address column into 3 separate entities:
1) a column for the address
2) a column for the street number
3) a column for the apartment or unit number
The data looks like the following:
address
----------------------
123 smith street #5234
5000 john ct
34 wood st apt# 23
523 fire road apt #87
charles way apt. 434
0987 misty lane unit B
I have already removed the street numbers into their own column. For this I used "np.where" using an easy logical condition that if the string started with digits extract them into the new street column.
I am now stuck on how to do this with apartment numbers.
I am assuming that because of the inconsistencies, I have to do something like:
df['apt/unit'] = np.where(str contains "apt", extract string starting at "apt" until end, else np.NaN)
df['apt/unit'] = np.where(str contains "unit", extract string starting at "unit" until end, else np.NaN)
Will I have to use regex to do this? If so, what is the way to go about that?
Are there any alternatives to this line of thinking?
Upvotes: 1
Views: 2332
Reputation: 286
If you want to use a regex, here's a way to do it that matches number, street, and apartment in separate named groups and then uses df.apply to populate new columns of the dataframe. Street name types are listed out brute force. Anything left over is the apartment number.
regex = '(?P<num>\d+)? ?(?P<street>\w+ (street|way|st|road|ct|lane)) ?(?P<apt>.+)?'
def get_num(address):
match = re.match(regex, address)
return match.group('num')
def get_street(address):
match = re.match(regex, address)
return match.group('street')
def get_apt(address):
match = re.match(address)
return match.group('apt')
df['num'] = df['address'].apply(get_num)
df['street'] = df['address'].apply(get_street)
df['apt'] = df['address'].apply(get_apt)
Upvotes: 1
Reputation: 18647
Using Series.str.extract
, something like:
pat = r'^(?P<number>\d+)?(?P<street>.+(?=\bapt|\bunit)|.+(?=#)|.+)(?P<apt_unit>(?:\bapt|\bunit|#).+)?'
df.address.str.extract(pat)
[out]
number street apt_unit
0 123 smith street #5234
1 5000 john ct NaN
2 34 wood st apt# 23
3 523 fire road apt #87
4 NaN charles way apt. 434
5 0987 misty lane unit B
Upvotes: 2
Reputation: 42916
Since you have multiple conditions for your apt/unit
column, you can use np.select
here like the following:
# Define our conditions
conditions = [
df.address.str.contains('apt'),
df.address.str.contains('unit'),
df.address.str.contains('#')
]
# Define our choices based on our conditions
choices = [
df.address.apply(lambda x: x[x.find('apt'):]),
df.address.apply(lambda x: x[x.find('unit'):]),
df.address.apply(lambda x: x[x.find('#'):])
]
# Apply this logic by creating the new column and cleaning up address column
df['apt/unit'] = np.select(conditions, choices, default = '')
# Clean up our address column
choices2 = [
df.address.apply(lambda x: x[:x.find('apt')]),
df.address.apply(lambda x: x[:x.find('unit')]),
df.address.apply(lambda x: x[:x.find('#')])
]
df['address'] = np.select(conditions, choices2, default = df.address)
Output
print(df)
address apt/unit
0 123 smith street #5234
1 5000 john ct
2 34 wood st apt# 23
3 523 fire road apt #87
4 charles way apt. 434
5 0987 misty lane unit B
Upvotes: 2
Reputation: 493
If you already have removed all street numbers the only numbers left should be apartment numbers. Therefore you can use a simple regex to extract all numbers that are left in this column:
df['apt_number'] = df.address.str.extract(r'([\d]+)')
edit: With the above answer alone the numbers will remain in the addresses column, to remove we simply replace them with an empty string:
df['address'] = df.address.str.replace(r'([\d]+)', '')
Upvotes: 0