Pandas regex replace with multiple values and spaces in the values

Question

I have the following Pandas code where I am trying to replace the names of countries with the string .

df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines()    # Reads all lines into a list and removes 
.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '')

However I can't get countries with spaces (like South Korea) to work correctly, since they do not get replaced. The problem seems to be that my \s is turning into \s. How can I avoid this or how can I fix the issue?

Valdi_Bo · Accepted Answer

There is no need to replace any space with \s.

Your pattern should rather include:

\b - "starting" word boundary,
(?:...|...|...) a non-capturing group with country names (alternatives),
\b - "ending" word boundary,

something like:

pattern = r'\b(?:China|South Korea|Taiwan)\b'

Then you can do the replacement:

df['title_type2'].str.replace(pattern, '')

I created test data as follows:

df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
    columns=['title_type'])
df['title_type2'] = df['title_type']

and got:

0      Abc 
1      Xyz 
2      Zxx 
3    No country name
Name: title_type2, dtype: object

Pandas regex replace with multiple values and spaces in the values

Answers (1)

Related Questions