Reputation: 1198
I am performing some tests on the house prices prediction competition of Kaggle.
For easiness, find here below the complete process to download, pre-process and start predicting using a simple linear regression model :
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
saveDir = "data"
if not os.path.exists("data"):
os.makedirs(saveDir)
api.competition_download_files("house-prices-advanced-regression-techniques","data")
print("the following files have been downloaded \n" + '\n'.join('{}'.format(item) for item in os.listdir("data")))
print("they are located in " + saveDir)
train = pd.read_csv(saveDir + r"\train.csv")
test = pd.read_csv(saveDir + r"\test.csv")
xTrain = train.iloc[:,1:-1] # remove id & SalePrice
yTrain = train.iloc[:,-1] # SalePrice
xTest = test.iloc[:,1:] # remove id
catData = xTrain.columns[xTrain.dtypes == object]
numData = list(set(xTrain.columns) - set(catData))
print("The number of columns in the original dataframe is " + str(len(xTrain.columns)))
print("The number of columns in the categorical and numerical data dds up to " + str(len(catData)+len(numData)))
def cleanData(data, catData, numData) :
dataClean = data.copy()
# Let's deal with NaN ...
# check where there are NaN in categorical
dataClean[catData].columns[dataClean[catData].isna().any(axis=0)]
# take care that some categorical could be numerics so
# differentiate the two cases
dataTypes = [dataClean.loc[dataClean.loc[:,col].notnull(),col].apply(type).iloc[0] for col in catData] # get the data type for each column
# have to be carefull to not take a data that is NaN or None
# when evaluating its type
from itertools import compress
catDataNum = [True if ((col == float) | (col == int)) else False for col in dataTypes ] # if data type is numeric (float/int), register it
catDataNum = list(compress(catData, catDataNum))
catDataNotNum = list(set(catData)-set(catDataNum))
print("The number of columns in the dataframe is " + str(len(dataClean.columns)))
print("The number of columns in the categorical and numerical data dds up to " +
str(len(catDataNum) + len(catDataNotNum)+len(numData)))
# Check what NA means for each feature ...
# BsmtQual : NA means no basement
# GarageType : NA means no garage
# BsmtExposure : NA means no basement
# Alley : NA means no alley access
# BsmtFinType2 : NA means no basement
# GarageFinish : NA means no garage
# did not check the rest ... I will just replace with a category "No"
# For categorical, NaN values mean the considered feature
# do not exist (this requires dataset analysis as performed above)
dataClean[catDataNotNum] = dataClean[catDataNotNum].fillna(value = 'No')
mean = dataClean[catDataNum].mean()
dataClean[catDataNum] = dataClean[catDataNum].fillna(value = mean)
# for numerical, replace with mean
mean = dataClean[numData].mean()
dataClean[numData] = dataClean[numData].fillna(value = mean)
return dataClean
xTrainClean = cleanData(xTrain, catData, numData)
# check if no NaN or None anymore
if xTrainClean.isna().sum().sum() != 0:
print(xTrainClean.iloc[:,xTrainClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in training data!")
# same with test data
# perform the cleaning
xTestClean = cleanData(xTest, catData, numData)
# check if no NaN or None anymore
if xTestClean.isna().sum().sum() != 0:
print(xTestClean.iloc[:,xTestClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in test data!")
import sklearn as sk
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# We would like to perform a linear regression on all data
# but some data are categorical ...
# so first, perform a one-hot encoding on categorical variables
ct = ColumnTransformer(transformers = [("OneHotEncoder", OneHotEncoder(categories='auto', drop=None,
sparse=False, n_values='auto',
handle_unknown = "error"),
catData)],
remainder = "passthrough")
ct = ct.fit(pd.concat([xTrainClean, xTestClean])) # fit on both xTrain & xTest to be sure to have all possible categorical values
# test it separately (.fit(xTrain) / .fit(xTest) and analyze to understand)
# resulting categories and values can be obtained through
# ct.named_transformers_ ['OneHotEncoder'].categories_
xTrainOneHot = ct.transform(xTrainClean)
xTestOneHotKaggle = xTestOneHot.copy()
from sklearn.model_selection import train_test_split
xTrainInternalOneHot, xTestInternalOneHot, yTrainInternal, yTestInternal = train_test_split(xTrainOneHot, yTrain, test_size=0.5, random_state=42, shuffle = False)
print("The training data now contains " + str(xTrainInternalOneHot.shape[0]) + " samples")
print("The training data now contains " + str(yTrainInternal.shape[0]) + " labels")
print("The test data now contains " + str(xTestInternalOneHot.shape[0]) + " samples")
print("The test data now contains " + str(yTestInternal.shape[0]) + " labels")
reg = LinearRegression().fit(xTrainInternalOneHot,yTrainInternal)
yTrainInternalPredict = reg.predict(xTrainInternalOneHot)
yTestInternalPredict = reg.predict(xTestInternalOneHot)
print("The R2 score on training data is equal to " + str(reg.score(xTrainInternalOneHot,yTrainInternal)))
print("The R2 score on the internal test data is equal to " + str(reg.score(xTestInternalOneHot, yTestInternal)))
from sklearn.metrics import mean_squared_log_error
print("Tke Kaggle metric score (RMSLE) on internal training data is equal to " +
str(np.sqrt(mean_squared_log_error(yTrainInternal, yTrainInternalPredict))))
print("Tke Kaggle metric score (RMSLE) on internal test data is equal to " +
str(np.sqrt(mean_squared_log_error(yTestInternal, yTestInternalPredict))))
So with the above process, one will get an error when computing the Kaggle metric i.e. RMLSE because some values are negative. The funny thing is that if I change the test_size parameter from 0.5 to 0.2 then no more negative values. One could understand as more data gets used to train so the model performs better. But if I move it from 0.2 to 0.3 (less dramatic change i.e. ~100 training samples) then the issue of the model predicting negative values appear again.
Two questions :
Is this expected i.e. that the model is so sensitive to training data ? This is even clearer because if test_size = 0.2 is used with shuffle = False then it works. If used when shuffle = True, then the model starts predicting negative values.
How to deal with such behavior ? Obviously, this is a very simple model (no standardization, no scaling, no regularization ...) but I believe it is interesting to really understand what is going on in this very simple model.
Upvotes: 0
Views: 434
Reputation: 6270
Is this expected i.e. that the model is so sensitive to training data ? This is even clearer because if test_size = 0.2 is used with shuffle = False then it works. If used when shuffle = True, then the model starts predicting negative values.
For your question, yes this split can matter!
How to deal with such behavior ? Obviously, this is a very simple model (no standardization, no scaling, no regularization ...) but I believe it is interesting to really understand what is going on in this very simple model.
Did you ever hear about cross-validation?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
The concept is to train your classifier/regression with several datasplits, which always have a different train/test-split to avoid this behavior you are explaning, then you can really judge your prediction quality, as new data could also have several different structures. So you run severaal Iterations and then judge about the outcome.
Upvotes: 1