Reputation: 531
I was trying encoding of data in the dataset named as train.csv
provided in this github repository. I used the following code to do so.
import pandas as pd
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder()
for col in df.columns:
if df[col].dtype == 'O':
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print(df)
And while encoding, the following output prompted.
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'
But when I looked the dataset, there wasn't any '<'
in the Alley
column.
And the previous columns have been encoded, but the Alley
column is causing an error. Please help me!
This is the colab notebook of the code
Upvotes: 1
Views: 241
Reputation: 863031
There is problem your missing values are not replaced in all columns, need assign back, also added .iloc[0]
to mode
for select first, if 2 or more values:
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print (df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
Id
1 60 3 65.0 8450 1 0 3
2 20 3 80.0 9600 1 0 3
3 60 3 68.0 11250 1 0 0
4 70 3 60.0 9550 1 0 0
5 60 3 84.0 14260 1 0 0
... ... ... ... ... ... ...
1456 60 3 62.0 7917 1 0 3
1457 20 3 85.0 13175 1 0 3
1458 70 3 66.0 9042 1 0 3
1459 20 3 68.0 9717 1 0 3
1460 20 3 75.0 9937 1 0 3
LandContour Utilities LotConfig ... PoolArea PoolQC Fence \
Id ...
1 3 0 4 ... 0 2 2
2 3 0 2 ... 0 2 2
3 3 0 4 ... 0 2 2
4 3 0 0 ... 0 2 2
5 3 0 2 ... 0 2 2
... ... ... ... ... ... ...
1456 3 0 4 ... 0 2 2
1457 3 0 4 ... 0 2 2
1458 3 0 4 ... 0 2 0
1459 3 0 4 ... 0 2 2
1460 3 0 4 ... 0 2 2
MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2 0 2 2008 8 4 208500
2 2 0 5 2007 8 4 181500
3 2 0 9 2008 8 4 223500
4 2 0 2 2006 8 0 140000
5 2 0 12 2008 8 4 250000
... ... ... ... ... ... ...
1456 2 0 8 2007 8 4 175000
1457 2 0 2 2010 8 4 210000
1458 2 2500 5 2010 8 4 266500
1459 2 0 4 2010 8 4 142125
1460 2 0 6 2008 8 4 147500
[1460 rows x 80 columns]
Upvotes: 1