Mayank Tripathi
Mayank Tripathi

Reputation: 523

Looking for smarter way to handle multiple version of a value in Python Pandas dataframe

I am working on Python data manipulation, and have a column with a variety of values that represent same value.

enter image description here

I did checked the uniques values and they are as array(['Male', 'male', 'm', 'FEMALE', 'female', 'F', 'M', 'Female'], dtype=object)

Please suggest how can i replace it to one version.

Below is the code I did try, but there are more such fields, so looking for smart way.

df['Gender'] = df['Gender'].replace(to_replace =["Male", "male", "m", "M"], value ="Male")
#similar code for Female value.

Another approach I am using is

df['Gender'] = np.where(df['Gender'].str.lower() == 'male', 'Male', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'm', 'Male', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'female', 'Female', df.Gender)
df['Gender'] = np.where(df['Gender'].str.lower() == 'f', 'Female', df.Gender)

Upvotes: 1

Views: 160

Answers (2)

RITESH DUBEY
RITESH DUBEY

Reputation: 75

You can try this with a nested numpy.where,

df['Gender']=np.where(df.Gender.str.lower().str.startswith('m'),'Male',np.where(df.Gender.str.lower().str.startswith('f'),'Female',df.Gender))

But in your case, the unique entries just comprise of different representations of "Male" and "Female" and no null so the very easy and quick solution for you can be something like this.

df['Gender']=np.where(df.Gender.str.lower().str.startswith('m'),'Male','Female')

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

You can use just 'm' and 'f':

pd.Series(['Male','male', 'm', 'Female', 'female', 'f']).str[0].str.lower()

Output:

0    m
1    m
2    m
3    f
4    f
5    f
dtype: object

Like this with your column:

df['Gender'] = df['Gender'].str[0].str.lower()

Upvotes: 1

Related Questions