Salma
Salma

Reputation: 119

Scikit-learn: error in replacing missing data

I am trying to preprocess my data by replacing the missing value by the mean.

My code is as follows:

#Load the Data 
import numpy as np
data_2 = np.genfromtxt('data.csv', delimiter=',', skip_header=1)

#the missing values in my dataset are identified by value = 0 
#I'm trying to replace the missing values in the third column 
from sklearn.preprocessing import Imputer 
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(data_2[:, 2])

it runs but gave these warnings:

/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

but my main problem is that it did not fill the missing data, I printed the data before and after the fitting and no change.

What's the thing I'm doing wrong?

Update: Here is few lines of my dataset:
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0

Upvotes: 0

Views: 908

Answers (1)

Aakash Makwana
Aakash Makwana

Reputation: 754

  • The first few lines you shared doesn't contain any null values, so becomes difficult to explain
  • Consider this slightly updated version of your dataset to make you understand.

    6,148,72,35,0,33.6,0.627,50,1
    1,85,,29,0,26.6,0.351,,
    ,183,64,,0,,0.672,32,1
    1,89,66,23,94,28.1,0.167,21,0
    
  • There is an easy way around filling missing values by using the library pandas

    #Load Libraries and data
    import pandas as pd
    df = pd.read_csv('data.csv',names=[1,2,3,4,5,6,7,8,9])
    
    #Fill the Null values with the mean
    df = df.fillna(df.mean())
    
  • names argument in read_csv function is used to give name to the columns of the csv file

  • fillna() function will fill the missing values.

Upvotes: 1

Related Questions