Reputation: 439
I have a Column in a Dataset which has categorical values and I want to convert them in Numerical values. I am trying to use LabelEncoder but get errors doing so.
from sklearn.preprocessing import LabelEncoder
m = hsp_train["Alley"]
m_enc = LabelEncoder()
j = m_enc.fit_transform(m)
I am getting an error:
unorderable types: float() > str()
The series in the Column has 3 values. I want them to be 0, 1, 2 respectively but I am getting that error.
I also tried this:
l = hsp_train["Alley"]
l_enc = pd.factorize(l)
hsp_train["Alley"] = l_enc[0]
But this is giving me values -1, 1, 2. which I don't want I want it from 1.
Upvotes: 4
Views: 2549
Reputation: 29711
It's obviously clear that you have missing values in your series. If you want to remove NaN
values from your series, just do hsp_train["Alley"].dropna()
Illustration:
df = pd.DataFrame({'Categorical': ['apple', 'mango', 'apple',
'orange', 'mango', 'apple',
'orange', np.NaN]})
Using LabelEncoder
to encode the categorical labels:
enc = LabelEncoder()
enc.fit_transform(df['Categorical'])
Gives:
TypeError: unorderable types: float() > str()
Doing pd.factorize
automatically assigns -1 to missing values by default and hence you get those values:
pd.factorize(df['Categorical'])[0]
array([ 0, 1, 0, 2, 1, 0, 2, -1])
If you do not want NAN
values to be identified and to consider them just as any string, you can do it while reading process using na_filter
:
df = pd.read_csv(data, na_filter=False, ...)
It also improves the performance of reading a relatively large file drastically.
Or, you could fill all the NaN
values using fillna
to the desired string of your choice:
df.fillna('Na', inplace=True)
This replaces all the NaN
values to your string value "Na" and you can continue as before.
Upvotes: 5