user1411335
user1411335

Reputation: 3580

How to extract string from dataframe after regex matching

Want to extract city name from the address which appear after zip code from pandas dataframe. Given: 10 rue des Treuils BP 12 33023, Bordeaux France I want to extract Bordeaux from column of dataframe.

City name is always first after the comma , but it is not guaranteed to be one word. Need to strip off country name which will be a fixed string like France , Italy etc.

More examples of french city names

Upvotes: 2

Views: 2020

Answers (3)

Pedro Lobito
Pedro Lobito

Reputation: 98861

United States will be fixed string which can be stripped off as on exact match


My solution is to remove the country name, which will leave us with the city name only.
This approach seems to be easier since country names are fixed and can be easily removed based on a list, i.e.:

  1. split() the address in two based on comma (,);
  2. replace() the country name with nothing;
  3. Use panda's apply() to apply get_city() function containing the steps above.
  4. Use panda's tolist() to convert column City to a list. This last step is optional, as it depends on what you'll do with the city names.

i.e.:

import pandas as pd
addresses = [['10 rue des Treuils BP 12 33023, Bordeaux France'],['Rua da Alegria 22, Lisboa Portugal'],['22 Some Street, NYC United States']]
df = pd.DataFrame(addresses,columns=['Address'])

countries = ['Portugal', 'France', 'United States']

def get_city(address):
    city_country = address.split(",")[1]
    for i in countries: city = city_country.replace(i, "")
    return city.strip()

df['City'] = df['Address'].apply(get_city)
print (df['City'].tolist())

Output:

['Bordeaux', 'Lisboa', 'NYC']

PS: You may want to lower() both the addresses and countries list in order to avoid case SenSitIve mismatches.

Upvotes: 2

Anton vBR
Anton vBR

Reputation: 18906

Yeah maybe some advanced regex could handle this but the pandas naive approach would be:

import pandas as pd
import numpy as np

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, Les Deux Alpes France',
                 '10 rue des Treuils BP 12 33023, New York United States'])

cities = np.where(col.str.endswith('United States'), 
                  col.str.split(', ').str[1].str.split().str[:-2].str.join(' '), 
                  col.str.split(', ').str[1].str.split().str[:-1].str.join(' '))

print(cities)
#['Bordeaux' 'Les Deux Alpes' 'New York']

A more general but not as effective solution (but who needs speed right?)

import pandas as pd

col = pd.Series(['10 rue des Treuils BP 12 33023, Bordeaux France',
                 '10 rue des Treuils BP 12 33023, New York United States',
                 '10 rue des Treuils BP 12 33023, Seoul South Korea',
                 '10 rue des Treuils BP 12 33023, Brazzaville Republic of Congo'])

countries = {'United States': 2 , 'South Korea': 2, 'Republic of Congo': 3}
n = [next((countries[k] for k,v in countries.items() if i.endswith(k)), 1) for i in col]
cities = [' '.join(i.split(', ')[1].split()[:-y]) for i,y in zip(col,n)]

print(cities)
# ['Bordeaux', 'Les Deux Alpes', 'New York', 'Seoul', 'Brazzaville']

And then simply assign back with:

df['city'] = cities

Upvotes: 0

Ryo Shiina
Ryo Shiina

Reputation: 568

If we consider your regex to be working with French addresses (ending by France), then you can use this :

/,\s([A-Z][A-Za-z\s-]+)\sFrance/gm

enter image description here

Link to the online regex simulator where I tested the expression

You mentioned earlier about the United States, but actually the way adresses are written is totally different, so you'll have to make another regex for it, I guess. (i.e: 4 Cross Lane Schererville, IN 46375)

Upvotes: 0

Related Questions