Sohaib Asim Syed
Sohaib Asim Syed

Reputation: 97

Predict test data using model based on training data set?

Im new to Data Science and Analysis. After going through a lot of kernels on Kaggle, I made a model that predicts the price of a property. Ive tested this model using my training data, but now I want to run it on my test data. Ive got a test.csv file and I want to use it. How do I do that? What i previously did with my training dataset:

#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/test.csv')

#factors that will predict the price
train_pr = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']

#set my model to DecisionTree
model = DecisionTreeRegressor()

#set prediction data to factors that will predict, and set target to SalePrice
prdata = train[train_pr]
target = train.SalePrice

#fitting model with prediction data and telling it my target
model.fit(prdata, target)

model.predict(prdata.head())

Now what I tried to do is, copy the whole code, and change the "train" with "test", and "predate" with "testprdata", and I thought it will work, but sadly no. I know I'm doing something wrong with this, idk what it is.

Upvotes: 6

Views: 34029

Answers (2)

TheGrimmScientist
TheGrimmScientist

Reputation: 2897

As long as you process the train and test data exactly the same way, that predict function will work on either data set. So you'll want to load both the train and test sets, fit on the train, and predict on either just the test or both the train and test.

Also, note the file you're reading is the test data. Assuming your file is named properly, even though you named the variable to be train, you are currently training on your test data.

#loading my train dataset into python
train = pd.read_csv('/Users/sohaib/Downloads/train.csv')
test = pd.read_csv('/Users/sohaib/Downloads/test.csv')

#factors that will predict the price
desired_factors = ['OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']

#set my model to DecisionTree
model = DecisionTreeRegressor()

#set prediction data to factors that will predict, and set target to SalePrice
train_data = train[desired_factors]
test_data = test[desired_factors]
target = train.SalePrice

#fitting model with prediction data and telling it my target
model.fit(train_data, target)

model.predict(test_data.head())

Upvotes: 4

tdube
tdube

Reputation: 2553

You are already using the trained model for prediction (model.predict(prdata.head())). If you want to use that model to predict on other test data, simply supply the other test data instead of prdata.head(). For example, you can use the model to predict all samples from prdata by removing .head() which restricts the DataFrame to the first 5 rows (but you just used this data to train the model; it's just an example).

Keep in mind, you still need a model to make predictions. Typically, you'll train a model and then present it with test data. Changing all of the references of train to test will not work, because you will not have a model for making predictions based on your test data unless you've saved it from training and restored it prior to presenting it with test data.

In your code, you are actually using your test.csv data file to train your model as you pass the data to the model.fit method. Typically, you will not train your model with data intended for testing.

Upvotes: 1

Related Questions