Filippo Sebastio
Filippo Sebastio

Reputation: 1112

If a column contains a word, then generate a new column with the value found

I am trying to clean my df['Country'] variable by creating a new one df['Country Clean'] that takes the value of the country variable if it finds it in the df['Country'] column.

I figured out though that if I repeat my command, I will also delete my previous findings and i will only get a variable reporting the finding for 'Russia'

Is there a way to do this?

data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' ,  'Yugoslavia', 'Russia']}
df = pd.DataFrame(data) 
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean']  = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')

Expected output

data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2) 
exp

Upvotes: 1

Views: 513

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

I suggest first normalizing the country names and then change the Country Clean column values according to the allowed country list:

normalize_countries={"Italie": "Italy", "Rusia": "Russia"}    # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings

countries = ["Italy", "Russia"]                               # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):  
    for e in countries:
        if e in s:
            return e
    return 'NaN'

df['Country Clean'] = df['Country Clean'].apply(applyFunc)

Output:

>>> df
  Number                Country Country Clean
0      1                Italy 1         Italy
1      2                 Italie         Italy
2      1                   Ecco           NaN
3      2  Russia is in Euroasia        Russia
4      1             Yugoslavia           NaN
5      2                 Russia        Russia

The df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()]) line searches for all misspelt country names as whole words in the Country column and replaces them with correct spelling variants.

You may also add whole word check when searching for countries if you use regexps in the countries list and then use re.search instead of if e in countries in the applyFunc.

Upvotes: 1

amanb
amanb

Reputation: 5473

Use:

In [15]: countries = ["italy", "russia", "yugoslavia", "italie"]

In [16]: for i in countries:df.loc[lambda x:x.Country.str.lower().str.contains(i), 'Country Clean'] = i.capitalize()

In [17]: df
Out[17]:
  Number                Country Country Clean
0      1                Italy 1         Italy
1      2                 Italie        Italie
2      1                   Ecco           NaN
3      2  Russia is in Euroasia        Russia
4      1             Yugoslavia    Yugoslavia
5      2                 Russia        Russia

Upvotes: 0

Related Questions