Reputation: 497
I am trying to Map the values from the dictionary, where if the Field values matches with the dictionary it must remove all the extra values from the same. However i can match the things but how i can remove the extra charaters from the column.
Input Data
col_data
Indi8
United states / 08
UNITED Kindom (55)
ITALY 22
israel
Expected Output:
col_data
India
United States
United Kindom
Italy
Israel
Script i am using :
match_val=['India','United Kingdom','Israel','United States','Italy']
lower = [x.lower() for x in match_val]
def nearest(s):
idx = np.argmax([SequenceMatcher(None, s.lower(), i).ratio() for i in lower])
return np.array(match_val)[idx]
df['col_data'] = df['col_data'].apply(nearest)
The above script matches the vales with the List, But not able to remove the extra characters from the same. How i can modify the script so that it can remove the extra characters as well after mapping.
Upvotes: 1
Views: 106
Reputation: 520878
I like this str.extract
approach:
df['col_data'] = df['col_data'].str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)').str.title()
The regex ([A-Za-z]+(?: [A-Za-z]+)*)
will match all all-letter words from the start of the column, omitting all content at the end which you want to remove.
Upvotes: 1