chris_tri
chris_tri

Reputation: 67

Replace string pattern in a data frame

suppose I have a data frame:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

How can I replace all cells which include the pattern Yemen with the string Yemen? The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

In the next step, is it possible to replace all cells including Andorra with Andorra and all cells including Yemen with Yemen in one step, using lists or dictionaries?

The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'Andorra', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

I tried for example

df.replace(regex='lorem Andorra ipsum', value='Andorra ')

which works for sure, since it looks for lorem Andorra ipsum. But that is a too specific approach. I further tried different reg expressions as

df.replace(regex=r'^Andorra.$', value='Andorra'). But it didn't work.

I appreciate any help in advance!

Upvotes: 4

Views: 69

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may use this approach if the number of words is not too huge:

import pandas as pd
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
l = ['Yemen', 'Andorra']
>>> df['Country'].str.replace(fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.*", r"\1")
0           Aruba
1         Andorra
2     Afgahnistan
3           Yemen
4    South Africa
Name: Country, dtype: object

The fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.* code creates a regex using the l list strings as alternation in the capturing group (with ID 1) and looks like

(?s).*?\b(Andorra|Yemen)\b.*

See the regex demo. Note the longer country names will be put first, so as to match the longest country name from country names sharing the same prefix.

Details:

  • (?s) - a re.DOTALL inline modifier, makes . match line break chars, too
  • .*? - any 0 or more chars, as few as possible
  • \b - word boundary
  • (Andorra|Yemen) - either Andorra or Yemen string
  • \b - word boundary
  • .* - the rest of the string.

Upvotes: 1

al_qaysee
al_qaysee

Reputation: 81

first of all, put all patterns in a single array then loop this array, on top of another loop that is looping in the df, to search for the pattern and replace.

Here is a code:

patterns = ["Yemen","Andorra"]
for pattern in patterns: 
  for index, row in df.iterrows(): 
    if row[0].lower().find(pattern.lower()) != -1 :  
        df['Country'][index] = pattern
        break
print(df)

And the reult:

        Country             Geographic region
0         Aruba   Latin America and Caribbean
1       Andorra       Europe and Central Asia
2   Afgahnistan                    South Asia
3         Yemen  Middle East and North Africa
4  South Africa            Sub Saharan Africa

Upvotes: 3

Related Questions