Reputation: 97
Im new to Data Science and Analysis. After going through a lot of kernels on Kaggle, I made a model that predicts the price of a property. Ive tested this model using my training data, but now I want to run it on my test data. Ive got a test.csv file and I want to use it. How do I do that? What i previously did with my training dataset:
#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/test.csv')
#factors that will predict the price
train_pr = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
#set my model to DecisionTree
model = DecisionTreeRegressor()
#set prediction data to factors that will predict, and set target to SalePrice
prdata = train[train_pr]
target = train.SalePrice
#fitting model with prediction data and telling it my target
model.fit(prdata, target)
model.predict(prdata.head())
Now what I tried to do is, copy the whole code, and change the "train" with "test", and "predate" with "testprdata", and I thought it will work, but sadly no. I know I'm doing something wrong with this, idk what it is.
Upvotes: 6
Views: 34029
Reputation: 2897
As long as you process the train and test data exactly the same way, that predict
function will work on either data set. So you'll want to load both the train and test sets, fit
on the train, and predict
on either just the test or both the train and test.
Also, note the file you're reading is the test
data. Assuming your file is named properly, even though you named the variable to be train
, you are currently training on your test data.
#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/train.csv')
test = pd.read_csv('/Users/sohaib/Downloads/test.csv')
#factors that will predict the price
desired_factors = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
#set my model to DecisionTree
model = DecisionTreeRegressor()
#set prediction data to factors that will predict, and set target to SalePrice
train_data = train[desired_factors]
test_data = test[desired_factors]
target = train.SalePrice
#fitting model with prediction data and telling it my target
model.fit(train_data, target)
model.predict(test_data.head())
Upvotes: 4
Reputation: 2553
You are already using the trained model for prediction (model.predict(prdata.head())
). If you want to use that model to predict on other test data, simply supply the other test data instead of prdata.head()
. For example, you can use the model to predict all samples from prdata
by removing .head()
which restricts the DataFrame to the first 5 rows (but you just used this data to train the model; it's just an example).
Keep in mind, you still need a model to make predictions. Typically, you'll train a model and then present it with test data. Changing all of the references of train
to test
will not work, because you will not have a model for making predictions based on your test data unless you've saved it from training and restored it prior to presenting it with test data.
In your code, you are actually using your test.csv
data file to train your model as you pass the data to the model.fit
method. Typically, you will not train your model with data intended for testing.
Upvotes: 1