Reputation: 67
suppose I have a data frame:
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
How can I replace all cells which include the pattern Yemen with the string Yemen? The result should be:
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Yemen', 'South Africa'],
'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
In the next step, is it possible to replace all cells including Andorra with Andorra and all cells including Yemen with Yemen in one step, using lists or dictionaries?
The result should be:
df = pd.DataFrame({'Country': ['Aruba', 'Andorra', 'Afgahnistan', 'Yemen', 'South Africa'],
'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
I tried for example
df.replace(regex='lorem Andorra ipsum', value='Andorra ')
which works for sure, since it looks for lorem Andorra ipsum. But that is a too specific approach. I further tried different reg expressions as
df.replace(regex=r'^Andorra.$', value='Andorra'). But it didn't work.
I appreciate any help in advance!
Upvotes: 4
Views: 69
Reputation: 626689
You may use this approach if the number of words is not too huge:
import pandas as pd
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
l = ['Yemen', 'Andorra']
>>> df['Country'].str.replace(fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.*", r"\1")
0 Aruba
1 Andorra
2 Afgahnistan
3 Yemen
4 South Africa
Name: Country, dtype: object
The fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.*
code creates a regex using the l
list strings as alternation in the capturing group (with ID 1) and looks like
(?s).*?\b(Andorra|Yemen)\b.*
See the regex demo. Note the longer country names will be put first, so as to match the longest country name from country names sharing the same prefix.
Details:
(?s)
- a re.DOTALL
inline modifier, makes .
match line break chars, too.*?
- any 0 or more chars, as few as possible\b
- word boundary(Andorra|Yemen)
- either Andorra
or Yemen
string\b
- word boundary.*
- the rest of the string.Upvotes: 1
Reputation: 81
first of all, put all patterns in a single array then loop this array, on top of another loop that is looping in the df, to search for the pattern and replace.
Here is a code:
patterns = ["Yemen","Andorra"]
for pattern in patterns:
for index, row in df.iterrows():
if row[0].lower().find(pattern.lower()) != -1 :
df['Country'][index] = pattern
break
print(df)
And the reult:
Country Geographic region
0 Aruba Latin America and Caribbean
1 Andorra Europe and Central Asia
2 Afgahnistan South Asia
3 Yemen Middle East and North Africa
4 South Africa Sub Saharan Africa
Upvotes: 3