tbone
tbone

Reputation: 1322

Handling missing values with linear regression

I trying to handling missing values in one of the column with linear regression.

The name of the column is "Landsize" and I am trying to predict NaN values ​​with linear regression using several other variables.

Here is the lin. regression code:

# Importing the dataset
dataset = pd.read_csv('real_estate.csv')

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = dataset[['Price','Rooms','Distance','Landsize']]
#Step-1: Split the dataset that contains the missing values and no missing values are test and train respectively.
x_train = data[data['Landsize'].notnull()].drop(columns='Landsize')
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']
#Step-2: Train the machine learning algorithm
linreg.fit(x_train, y_train)
#Step-3: Predict the missing values in the attribute of the test data.
predicted = linreg.predict(x_test)
#Step-4: Let’s obtain the complete dataset by combining with the target attribute.
dataset.Landsize[dataset.Landsize.isnull()] = predicted
dataset.info()

When I try to check the regression result I get this error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Accuracy:

accuracy = linreg.score(x_test, y_test)
print(accuracy*100,'%')

Upvotes: 1

Views: 3873

Answers (1)

Souha Gaaloul
Souha Gaaloul

Reputation: 328

I think what you are doing wrong here is you are passing NaN values to the algorithm, dealing with NaN values is one of the primary steps for preprocessing data. So perhaps you need to convert your NaN values to 0 and predict when you have Landsize = 0 (which is the same as having NaN value logically because a landsize can't be 0 ).

Another thing I think you're doing wrong is:

x_train = data[data['Landsize'].notnull()].drop(columns='Landsize') 
y_train = data[data['Landsize'].notnull()]['Landsize']
x_test = data[data['Landsize'].isnull()].drop(columns='Landsize')
y_test = data[data['Landsize'].isnull()]['Landsize']

You're assigning the same data for your training and test set. You should maybe do:

X = data[data['Landsize'].notnull()].drop(columns='Landsize')    
y = data[data['Landsize'].notnull()]['Landsize']  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Upvotes: 2

Related Questions