Sahil
Sahil

Reputation: 439

Using LabelEncoder for a series in scikitlearn

I have a Column in a Dataset which has categorical values and I want to convert them in Numerical values. I am trying to use LabelEncoder but get errors doing so.

from sklearn.preprocessing import LabelEncoder
m = hsp_train["Alley"]
m_enc = LabelEncoder()
j = m_enc.fit_transform(m)

I am getting an error:

unorderable types: float() > str()

The series in the Column has 3 values. I want them to be 0, 1, 2 respectively but I am getting that error.

I also tried this:

l = hsp_train["Alley"]
l_enc = pd.factorize(l)
hsp_train["Alley"] = l_enc[0]

But this is giving me values -1, 1, 2. which I don't want I want it from 1.

Upvotes: 4

Views: 2549

Answers (1)

Nickil Maveli
Nickil Maveli

Reputation: 29711

It's obviously clear that you have missing values in your series. If you want to remove NaN values from your series, just do hsp_train["Alley"].dropna()

Illustration:

df = pd.DataFrame({'Categorical': ['apple', 'mango', 'apple', 
                                   'orange', 'mango', 'apple', 
                                   'orange', np.NaN]})

Using LabelEncoder to encode the categorical labels:

enc = LabelEncoder()
enc.fit_transform(df['Categorical'])

Gives:

TypeError: unorderable types: float() > str()

Doing pd.factorize automatically assigns -1 to missing values by default and hence you get those values:

pd.factorize(df['Categorical'])[0]
array([ 0,  1,  0,  2,  1,  0,  2, -1])

If you do not want NAN values to be identified and to consider them just as any string, you can do it while reading process using na_filter:

df = pd.read_csv(data, na_filter=False, ...)

It also improves the performance of reading a relatively large file drastically.


Or, you could fill all the NaN values using fillna to the desired string of your choice:

df.fillna('Na', inplace=True)

This replaces all the NaN values to your string value "Na" and you can continue as before.

Upvotes: 5

Related Questions