Reputation: 3580
Want to extract city name from the address which appear after zip code from pandas dataframe.
Given:
10 rue des Treuils BP 12 33023, Bordeaux France
I want to extract Bordeaux
from column of dataframe.
City name is always first after the comma , but it is not guaranteed to be one word. Need to strip off country name which will be a fixed string like France , Italy etc.
More examples of french city names
Les Deux Alpes
Val dIsere
Upvotes: 2
Views: 2020
Reputation: 98861
United States will be fixed string which can be stripped off as on exact match
My solution is to remove the country name, which will leave us with the city name only.
This approach seems to be easier since country names are fixed and can be easily removed based on a list
, i.e.:
split()
the address in two based on comma (,
);replace()
the country name with nothing
;apply()
to apply get_city()
function containing the steps above.tolist()
to convert column City
to a list. This last step is optional, as it depends on what you'll do with the city names.i.e.:
import pandas as pd
addresses = [['10 rue des Treuils BP 12 33023, Bordeaux France'],['Rua da Alegria 22, Lisboa Portugal'],['22 Some Street, NYC United States']]
df = pd.DataFrame(addresses,columns=['Address'])
countries = ['Portugal', 'France', 'United States']
def get_city(address):
city_country = address.split(",")[1]
for i in countries: city = city_country.replace(i, "")
return city.strip()
df['City'] = df['Address'].apply(get_city)
print (df['City'].tolist())
Output:
['Bordeaux', 'Lisboa', 'NYC']
PS:
You may want to lower()
both the addresses and countries list in order to avoid case SenSitIve mismatches.
Upvotes: 2
Reputation: 18906
Yeah maybe some advanced regex could handle this but the pandas naive approach would be:
import pandas as pd
import numpy as np
col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
'10 rue des Treuils BP 12 33023, Les Deux Alpes France',
'10 rue des Treuils BP 12 33023, New York United States'])
cities = np.where(col.str.endswith('United States'),
col.str.split(', ').str[1].str.split().str[:-2].str.join(' '),
col.str.split(', ').str[1].str.split().str[:-1].str.join(' '))
print(cities)
#['Bordeaux' 'Les Deux Alpes' 'New York']
A more general but not as effective solution (but who needs speed right?)
import pandas as pd
col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
'10 rue des Treuils BP 12 33023, New York United States',
'10 rue des Treuils BP 12 33023, Seoul South Korea',
'10 rue des Treuils BP 12 33023, Brazzaville Republic of Congo'])
countries = {'United States': 2 , 'South Korea': 2, 'Republic of Congo': 3}
n = [next((countries[k] for k,v in countries.items() if i.endswith(k)), 1) for i in col]
cities = [' '.join(i.split(', ')[1].split()[:-y]) for i,y in zip(col,n)]
print(cities)
# ['Bordeaux', 'Les Deux Alpes', 'New York', 'Seoul', 'Brazzaville']
And then simply assign back with:
df['city'] = cities
Upvotes: 0
Reputation: 568
If we consider your regex to be working with French addresses (ending by France), then you can use this :
/,\s([A-Z][A-Za-z\s-]+)\sFrance/gm
Link to the online regex simulator where I tested the expression
You mentioned earlier about the United States, but actually the way adresses are written is totally different, so you'll have to make another regex for it, I guess. (i.e: 4 Cross Lane Schererville, IN 46375)
Upvotes: 0