Samar Pratap Singh
Samar Pratap Singh

Reputation: 531

How to detect suspicious error in a column of a dataset?

I was trying encoding of data in the dataset named as train.csv provided in this github repository. I used the following code to do so.

import pandas as pd 
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder() 
for col in df.columns:
    if df[col].dtype == 'O':
        print(df[col])
        df[col] = label_encoder.fit_transform(df[col])
print(df) 

And while encoding, the following output prompted.

MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'

But when I looked the dataset, there wasn't any '<' in the Alley column. And the previous columns have been encoded, but the Alley column is causing an error. Please help me!

This is the colab notebook of the code

Upvotes: 1

Views: 241

Answers (1)

jezrael
jezrael

Reputation: 863031

There is problem your missing values are not replaced in all columns, need assign back, also added .iloc[0] to mode for select first, if 2 or more values:

from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)

colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)

df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])

label_encoder = preprocessing.LabelEncoder() 
for col in colsObj:
    print(df[col])
    df[col] = label_encoder.fit_transform(df[col])

print (df)
      MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
Id                                                                          
1             60         3         65.0     8450       1      0         3   
2             20         3         80.0     9600       1      0         3   
3             60         3         68.0    11250       1      0         0   
4             70         3         60.0     9550       1      0         0   
5             60         3         84.0    14260       1      0         0   
         ...       ...          ...      ...     ...    ...       ...   
1456          60         3         62.0     7917       1      0         3   
1457          20         3         85.0    13175       1      0         3   
1458          70         3         66.0     9042       1      0         3   
1459          20         3         68.0     9717       1      0         3   
1460          20         3         75.0     9937       1      0         3   

      LandContour  Utilities  LotConfig  ...  PoolArea  PoolQC  Fence  \
Id                                       ...                            
1               3          0          4  ...         0       2      2   
2               3          0          2  ...         0       2      2   
3               3          0          4  ...         0       2      2   
4               3          0          0  ...         0       2      2   
5               3          0          2  ...         0       2      2   
          ...        ...        ...  ...       ...     ...    ...   
1456            3          0          4  ...         0       2      2   
1457            3          0          4  ...         0       2      2   
1458            3          0          4  ...         0       2      0   
1459            3          0          4  ...         0       2      2   
1460            3          0          4  ...         0       2      2   

      MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  SalePrice  
Id                                                                              
1               2        0       2    2008         8              4     208500  
2               2        0       5    2007         8              4     181500  
3               2        0       9    2008         8              4     223500  
4               2        0       2    2006         8              0     140000  
5               2        0      12    2008         8              4     250000  
          ...      ...     ...     ...       ...            ...        ...  
1456            2        0       8    2007         8              4     175000  
1457            2        0       2    2010         8              4     210000  
1458            2     2500       5    2010         8              4     266500  
1459            2        0       4    2010         8              4     142125  
1460            2        0       6    2008         8              4     147500  

[1460 rows x 80 columns]

Upvotes: 1

Related Questions