If a column contains a word, then generate a new column with the value found

Question

I am trying to clean my df['Country'] variable by creating a new one df['Country Clean'] that takes the value of the country variable if it finds it in the df['Country'] column.

I figured out though that if I repeat my command, I will also delete my previous findings and i will only get a variable reporting the finding for 'Russia'

Is there a way to do this?

data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' ,  'Yugoslavia', 'Russia']}
df = pd.DataFrame(data) 
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean']  = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')

Expected output

data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2) 
exp

Wiktor Stribiżew · Accepted Answer

I suggest first normalizing the country names and then change the Country Clean column values according to the allowed country list:

normalize_countries={"Italie": "Italy", "Rusia": "Russia"}    # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings

countries = ["Italy", "Russia"]                               # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):  
    for e in countries:
        if e in s:
            return e
    return 'NaN'

df['Country Clean'] = df['Country Clean'].apply(applyFunc)

Output:

>>> df
  Number                Country Country Clean
0      1                Italy 1         Italy
1      2                 Italie         Italy
2      1                   Ecco           NaN
3      2  Russia is in Euroasia        Russia
4      1             Yugoslavia           NaN
5      2                 Russia        Russia

The df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()]) line searches for all misspelt country names as whole words in the Country column and replaces them with correct spelling variants.

You may also add whole word check when searching for countries if you use regexps in the countries list and then use re.search instead of if e in countries in the applyFunc.

If a column contains a word, then generate a new column with the value found

Answers (2)

Related Questions