Reputation: 523
I am working on Python data manipulation, and have a column with a variety of values that represent same value.
I did checked the uniques values and they are as array(['Male', 'male', 'm', 'FEMALE', 'female', 'F', 'M', 'Female'], dtype=object)
Please suggest how can i replace it to one version.
Below is the code I did try, but there are more such fields, so looking for smart way.
df['Gender'] = df['Gender'].replace(to_replace =["Male", "male", "m", "M"], value ="Male")
#similar code for Female value.
Another approach I am using is
df['Gender'] = np.where(df['Gender'].str.lower() == 'male', 'Male', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'm', 'Male', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'female', 'Female', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'f', 'Female', df.Gender)
Upvotes: 1
Views: 160
Reputation: 75
You can try this with a nested numpy.where
,
df['Gender']=np.where(df.Gender.str.lower().str.startswith('m'),'Male',np.where(df.Gender.str.lower().str.startswith('f'),'Female',df.Gender))
But in your case, the unique entries just comprise of different representations of "Male" and "Female" and no null so the very easy and quick solution for you can be something like this.
df['Gender']=np.where(df.Gender.str.lower().str.startswith('m'),'Male','Female')
Upvotes: 1
Reputation: 153460
You can use just 'm' and 'f':
pd.Series(['Male','male', 'm', 'Female', 'female', 'f']).str[0].str.lower()
Output:
0 m
1 m
2 m
3 f
4 f
5 f
dtype: object
Like this with your column:
df['Gender'] = df['Gender'].str[0].str.lower()
Upvotes: 1