Replace string pattern in a data frame

Question

suppose I have a data frame:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

How can I replace all cells which include the pattern Yemen with the string Yemen? The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

In the next step, is it possible to replace all cells including Andorra with Andorra and all cells including Yemen with Yemen in one step, using lists or dictionaries?

The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'Andorra', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

I tried for example

df.replace(regex='lorem Andorra ipsum', value='Andorra ')

which works for sure, since it looks for lorem Andorra ipsum. But that is a too specific approach. I further tried different reg expressions as

df.replace(regex=r'^Andorra.$', value='Andorra'). But it didn't work.

I appreciate any help in advance!

Wiktor Stribiżew · Accepted Answer

You may use this approach if the number of words is not too huge:

import pandas as pd
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
l = ['Yemen', 'Andorra']
>>> df['Country'].str.replace(fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.*", r"\1")
0           Aruba
1         Andorra
2     Afgahnistan
3           Yemen
4    South Africa
Name: Country, dtype: object

The fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.* code creates a regex using the l list strings as alternation in the capturing group (with ID 1) and looks like

(?s).*?\b(Andorra|Yemen)\b.*

See the regex demo. Note the longer country names will be put first, so as to match the longest country name from country names sharing the same prefix.

Details:

(?s) - a re.DOTALL inline modifier, makes . match line break chars, too
.*? - any 0 or more chars, as few as possible
\b - word boundary
(Andorra|Yemen) - either Andorra or Yemen string
\b - word boundary
.* - the rest of the string.

Replace string pattern in a data frame

Answers (2)

Related Questions