Reputation: 159
I am running a project which implements a two-step procedure for predicting whether an individual will or will not pay back their loan. This project is designed to teach us about retail credit risk and the everyday individual borrowing e.g. credit cards.
The two steps are as follows
Run a multivariate logistic regression on "resolved" cases. That is, these observations have a clear result in their dependent variable being 1 for "Cure" and 0 for "liquidated" For this section, I am using factors
Now that I have a model which has been field past information on whether individuals did manage to clear their debt or declared insolvent. I am to apply this model to individuals who are currently unable to pay back their loan.
The premise of this is to supplement the closed cases with open cases. The open cases will therefore, have a probability of "paying back" or "cure" attached to them.
Now my input table looks like this
Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved 1 1 30 1500 3 3
Resolved 0 1 15 750 1 1
----------------------------------------------------------------
Unresolved 1 0.6 5 500 6 6
Unresolved 0 0.4 5 500 6 6
I have separated the Unresolved cases to identify that each observation follows these rules - Each unresolved observation is duplicated - The first is given a '1' for cure and a weight equal to the probability of cure estimated by the model in step 1
What is the impact of using the weight statement? Should I be using An inflated zero-one beta regression or a fractional logit model instead?
I have tried running the above example with the SAShelp.baseball dataset to allow you to run it
/*Split the dataset into resolved and unresolved*/
DATA baseball_resolved
baseball_unresolved
;
SET sashelp.baseball
(KEEP = cr: logsalary);
IF NOT MISSING(logsalary) THEN DO;
IF logsalary > 6.5 THEN flag = 1;
ELSE flag = 0;
END;
IF NOT MISSING(logsalary) THEN OUTPUT baseball_resolved;
ELSE OUTPUT baseball_unresolved;
DROP logSalary;
RUN;
/*Predict the model on the resolved cases*/
PROC LOGISTIC DESCENDING
OUTMODEL = in_model_baseball
DATA = baseball_resolved
PLOTS(ONLY) = NONE;
MODEL flag (Event = '1') = cr:
/
SELECTION = NONE
LINK = LOGIT
;
RUN;
QUIT;
/*Apply the model to the unresolved cases*/
PROC LOGISTIC
INMODEL = in_model_baseball;
SCORE DATA = baseball_unresolved
OUT = unresolved_score
(KEEP = cr: p_1 p_0);
RUN;
/*Now output duplicate rows, with a weight attached*/
DATA unresolved_baseball_p_cure;
SET unresolved_score
(RENAME = (p_1 = weight));
flag = 1;
;
DROP p_0;
RUN;
DATA unresolved_baseball_p_non_cure;
SET unresolved_score
(RENAME = (p_0 = weight));
flag = 1;
;
DROP p_1;
RUN;
/*Attach a weight of 1 to all resolved cases*/
DATA baseball_resolved_weight;
SET baseball_resolved;
WEIGHT = 1;
RUN;
/*Merge the tables*/
DATA full_table
(rename = (weight = weight_var));
SET
baseball_resolved_weight
unresolved_baseball_p_cure
unresolved_baseball_p_non_cure;
RUN;
/*Run a logistic regression with weight*/
proc logistic
data = full_table;
model flag (EVENT = '1') = cr:;
weight weight_var;
RUN;
Does the weight statement work in the context I am trying? My objective is essentially run a logistic regression on 1's and 0's but to consider that "unresolved" cases are duplicates with a "probability of cure" attached
Upvotes: 0
Views: 2445
Reputation: 71
A weight statement applies weight to an entire row. It is on both independents and the dependent
For example, if you have only these four rows in a dataset,
Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved 1 1 30 1500 3 3
Resolved 0 1 15 750 1 1
Unresolved 1 0.6 5 500 6 6
Unresolved 0 0.4 5 500 6 6
The way to look at this is: though you actually have 4 rows, for all computational purposes, this dataset is understood to have only 3 (Sigma(weight) = 1 + 1 + 0.6 + 0.4 = 3) rows.
So, when you run a proc logistic with weight variable as 'weight' on the above 4 observations dataset, you are technically modeling a logistic regression on:
3 observations, with the number of observations with (dependent_var=1) being 1.6; and the number of observations with (dependent_var = 0) being 1.4;
The weights are also implied on the independent variables (X1 - X5). For example, if you want to compute the mean of X1, it is no longer (30+15+5+5)/4; instead, it is (30*1+15*1+5*0.6 +5*0.4)/3
This is weight from a technical standpoint. However, comments on your premise, and the question of the efficacy of such an approach, I would refrain from commenting here, as it depends on further understanding of your case and your comfort level with the assumptions made from a credit risk standpoint...
hope this helps...
Upvotes: 1