Reputation: 51
I started learning maching learning on Python using Pandas and Sklearn.
I tried to use the LinearRegression().fit
method :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
house_data = pd.read_csv(r"C:\Users\yassine\Desktop\ml\OC-tp-ML\house_data.csv")
y = house_data[["price"]]
x = house_data[["surface","arrondissement"]]
X = house_data.iloc[:, 1:3].values
x_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25, random_state=1)
model = LinearRegression()
model.fit(x_train, y_train)
When I run the code, I have this message :
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can You help me please.
Upvotes: 4
Views: 1255
Reputation: 4275
Machine learning models may require you to impute the data as part of your data cleaning process. Linear regression cares a lot about the yhat, so I usually start with imputing the mean. If you aren't comfortable imputing the missing data, you can drop the observations that contain NaN (provided you only have a small proportion of NaN observations.)
Imputing the mean can look like this:
df = df.fillna(df.mean())
Imputing to zero can look like this:
df = df.fillna(0)
Imputing to a custom result can look like:
df = df.fillna(my_func(args))
Dropping altogether can look like:
df = df.dropna()
Prepping so that inf
may be caught by these methods ahead of time can look like:
df.replace([np.inf, -np.inf], np.nan)
Upvotes: 4