NKJ
NKJ

Reputation: 497

How to Remove Extra characters from the column Value using python

I am trying to Map the values from the dictionary, where if the Field values matches with the dictionary it must remove all the extra values from the same. However i can match the things but how i can remove the extra charaters from the column.

Input Data

col_data

Indi8
United states / 08
UNITED Kindom (55)
ITALY 22
israel

Expected Output:

col_data

India
United States
United Kindom
Italy
Israel

Script i am using :

match_val=['India','United Kingdom','Israel','United States','Italy']

lower = [x.lower() for x in match_val]
def nearest(s):
    idx = np.argmax([SequenceMatcher(None, s.lower(), i).ratio() for i in lower])
    return np.array(match_val)[idx]

df['col_data'] = df['col_data'].apply(nearest)

The above script matches the vales with the List, But not able to remove the extra characters from the same. How i can modify the script so that it can remove the extra characters as well after mapping.

Upvotes: 1

Views: 106

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520878

I like this str.extract approach:

df['col_data'] =  df['col_data'].str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)').str.title()

The regex ([A-Za-z]+(?: [A-Za-z]+)*) will match all all-letter words from the start of the column, omitting all content at the end which you want to remove.

Upvotes: 1

Related Questions