Reputation: 11
I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
Upvotes: 1
Views: 1426
Reputation: 12909
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
It looks like you are storing your model as a binary file that can be processed with proc plm
, but you do not need to do it this way since you've already saved your model with the outmodel
statement in proc logistic
. The store
statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel
since it's a little more straight-forward. Let's look at a really simple example using sashelp.class
:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel
. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod
is your new production data coming in.
Let's take a look at the predictions
output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex
(which stands for Into) is the prediction. The other columns starting with P
are probabilities for predicting male or female, and the column starting with F
(which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.
Upvotes: 1