Reputation: 8096
I have the following Pandas code where I am trying to replace the names of countries with the string <country>
.
df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines() # Reads all lines into a list and removes \n.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '<country>')
However I can't get countries with spaces (like South Korea) to work correctly, since they do not get replaced. The problem seems to be that my \s
is turning into \\s
. How can I avoid this or how can I fix the issue?
Upvotes: 0
Views: 282
Reputation: 31011
There is no need to replace any space with \s.
Your pattern should rather include:
\b
- "starting" word boundary,(?:...|...|...)
a non-capturing group with country names (alternatives),\b
- "ending" word boundary,something like:
pattern = r'\b(?:China|South Korea|Taiwan)\b'
Then you can do the replacement:
df['title_type2'].str.replace(pattern, '<country>')
I created test data as follows:
df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
columns=['title_type'])
df['title_type2'] = df['title_type']
and got:
0 Abc <country>
1 Xyz <country>
2 Zxx <country>
3 No country name
Name: title_type2, dtype: object
Upvotes: 1