Zhu Juliette
Zhu Juliette

Reputation: 11

(SAS)how to make prediction to new data using a trained logistic regression model?

I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)

I have already splitter the dataset into 70%train and 30%test my code looks like:

/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
    class purpose term grade yearsemployment homeownership incomeVerified;
    model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
                    date
                    isJointApplication
                    loanAmount
                    interestRate
                    monthlyPayment
                    annualIncome
                    dtiRatio
                    lengthCreditHistory
                    numTotalCreditLines
                    numOpenCreditLines
                    numOpenCreditLines1Year
                    revolvingBalance
                    revolvingUtilizationRate
                    numDerogatoryRec
                    numDelinquency2Years
                    numChargeoff1year
                    numInquiries6Mon                    
/
    selection=stepwise
    details
    lackfit;
    score data= test out=score1;
    store log_model;
run;

/*Score model*/
proc logistic inmodel=model.log;
    score data=train out=score2 fitstat;
run;

proc logistic inmodel=model.log;
    score data=test out=score3 fitstat;
run;

/*confusion matrix*/
proc freq data=score2;
    tables f_bad_good*i_bad_good / nocol norow; 
run;

proc freq data=score3;
    tables f_bad_good*i_bad_good / nocol norow; 
run;

My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?

Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!

Upvotes: 1

Views: 1426

Answers (1)

Stu Sztukowski
Stu Sztukowski

Reputation: 12909

You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:

  • An input dataset
  • A model to apply to the data

It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:

data train
     prod;

    set sashelp.class;

    if(_N_ LE 15) then output train;
        else output prod;
run;

proc logistic data=train outmodel=sasuser.logmodel;
    model sex = age height weight;
run;

We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:

proc logistic inmodel=sasuser.logmodel;
    score data=prod out=predictions;
run;

Assume prod is your new production data coming in.

Let's take a look at the predictions output dataset:

Name    Sex Age Height  Weight  F_Sex   I_Sex   P_F             P_M
Robert  M   12  64.8    128     M       M       0.0023352346    0.9976647654
Ronald  M   15  67      133     M       M       0.1822442826    0.8177557174
Thomas  M   11  57.5    85      M       M       0.148103678     0.851896322
William M   15  66.5    112     M       F       0.7322326277    0.2677673723

The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.

It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.

/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());

/* Add a timestamp and clean up the dataset */
data predictions;
    set predictions;

    prediction_ts = &now;

    format prediction_ts datetime.;

    keep name age height weight i_sex prediction_ts;
    rename i_sex = predicted_sex;
run;

/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
    proc append base=master_dataset data=predictions force;
    run;
%end;

    /* Otherwise, create it */
    %else %do;
        data master_dataset;
            set predictions;
        run;
    %end;

You can then pull the most recent prediction for any given primary key. For example:

proc sql;
    select *
    from master_dataset
    having prediction_ts = max(prediction_ts)
    ;
quit;

You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.

Upvotes: 1

Related Questions