Reputation: 191
I have a pandas dataframe with a column full of addresses. In general, I have inconsistent formatting and want to make the formatting consistent.
as an example:
df = pd.DataFrame({'apt': [1, 2, 3], 'unit': ['f', 'g', 'a'], 'address:
['15 E 59th St', '4 4th St', '4 83rd street']})
I am currently formatting my column as so:
df['address'] = df['address'].str.title().str.replace(r"(\d+)
(st|nd|rd|th)\b", r"\1")
to remove "ordinality" from the street numbers. However, sometimes the address ends in an abbreviation 'st' and sometimes the full word 'street' and I think simply removing the last word of each string would be the easiest way to get around this. How can I do this in a Pythonic way so that I am applying something to the entire column that removes the last word? I've tried rsplit and it won't work the way I have it:
df['address'].str.title().str.replace(r"(\d+)(st|nd|rd|th)\b",
r"\1").str.rsplit([' ', 1])[0]
Edit: the current answer is very helpful, but I have realized that dropping the street type actually won't work in a few cases where a street and avenue have the same beginning (1 st and 1 ave). Is there a way to make the street endings consistent so that all versions of all street endings are the same?
Upvotes: 2
Views: 4034
Reputation: 323236
Doing with str.rsplit
df.address.str.rsplit(' ',1).str[0]
0 15 E 59th
1 4 4th
2 4 83rd
Name: address, dtype: object
df.address=df.address.str.rsplit(' ',1).str[0]
Upvotes: 10