kishore
kishore

Reputation: 541

How would I create an index to be used in regression?

I have 2 continuous variables, each having values in the range [0, 1]. Each can be categorized as Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$). I need to create an index using both the variables and use this index in a regression model. The generated index will be as per following truth table:

Var1/ Var2    | Low | Medium | High   |
=======================================
Low           | Low | Low    | Low    |
Medium        | Low | Medium | Medium |
High          | Low | Medium | High   |
=======================================

Straight forward multiplication of the two variables is not the solution as some values will yield a Medium output (var1 = 0.75 and var2 = 0.8 for example).

In the model, I would like to use the index expression (rather than the categorical transformation). This will preserve the data variation.

What f(var1, var2) will provide me this index to be used in lm/R?

Help!!!

Upvotes: 0

Views: 4011

Answers (5)

LyzandeR
LyzandeR

Reputation: 37879

In my point of view since you want to use this new index in a regression, you are trying to do what is known as feature elimination. Generally, it is best that you use all the variables that you have if the total number of variables is small. Now if the number of variables is big and you need to therefore eliminate some then there are multiple ways to do it including stepwise elimination, recursive feature elimination etc.

In your case you only have 2 variables and essentially you want to combine those 2 without losing any variance. Well, to my point of view one thing you can use is Principal Component Analysis. Let's see an example:

#create data
var1 <- runif(1:100)
var2 <- runif(1:100)
df <- data.frame(var1,var2)

#the below line will create a PCA model
PCAmod <- princomp(var1+var2,data=df) #uses formula syntax without a response variable

> summary(PCAmod)
Importance of components:
                          Comp.1
Standard deviation     0.4052599
Proportion of Variance 1.0000000
Cumulative Proportion  1.0000000

The above shows that a new principal component has been created i.e. a vector of 100 new elements that in this example explains 100% of the variance of var1 and var2 (proporsion of variance in the table above).

newvar <- PCAmod$scores #the new vector

Essentially, the newvar can be used instead of var1 and var2

If you need the vector to be numbers ranging between [0,1] then you can scale it:

scaled_newvar <- scale(newvar,center=min(newvar), scale=max(newvar)-min(newvar) )

> summary(scaled_newvar)
     Comp.1      
 Min.   :0.0000  
 1st Qu.:0.2991  
 Median :0.4607  
 Mean   :0.4788  
 3rd Qu.:0.6566  
 Max.   :1.0000  

However, the above will probably not confirm your 'low','medium','high' condition table but I think this is the right thing to do if you will use the above in a regression.

If the above is not satisfying enough then (and I wouldn't recommend it) then:

  1. Just use the min(var1,var2) for each combination and use that
  2. Multiply the two, applying the boundary value if it is outside the range you would want it to be e.g. if both var1 and var2 are high and their product is medium then choose 0.75 as the correct value.
  3. According to your final edit, you could just multiply the 2 together without caring about 'low','medium','high'

Upvotes: 0

IRTFM
IRTFM

Reputation: 263331

After re-reading your request my (second) guess is that you want this: only the "numerical index" and you could dispense with the use of a character vector label. If entered as a numerical variable in a regression formula the p-value for that synthetic interaction would give you a "test of trend" for the joint "minimum" descretized level condition.

inter.n <-  pmin( findInterval(x, c(0, .25, .7, 1)), 
                  findInterval(y, c(0, .25, .7, 1)) )

Earlier comments: At the moment it is unclear how you want the inequalities to work when values are at the boundaries. The findInterval function can be used when the boundaries are closed on either the right (the default) or the left. You say : " Low ($\le 0.25$), Medium ($0.25 - 0.70$) and High ($\ge 0.7$)", which would make a value of either 0.2 or 0.7 a member of two groups. There would be fairly simple code with which is Low ($\lt 0.25$), Medium ($\ge 0.25 & $\lt 0.70$) and High ($\ge 0.7$):

  x=runif(1000)
  y=runif(1000)
 inter <- c("Low", "Middle", "High")[ pmin( findInterval(x, c(0,.25,.7,1)), 
                                          findInterval(y, c(0, .25, .7, 1)))]
> table(inter)
inter
  High    Low Middle 
    78    383    539 

If you use a modification of @BenBolker's cfun that makes ordered factors, you can get pmin to work directly on the values:

cfun2 <- function(x) cut(x,c(0, 0.25, 0.7, 1.01), include.lowest=TRUE, 
               labels=c("low","medium","high"), ordered=TRUE)
inter.f <- pmin( cfun2(x) , cfun2(y) )

 table(inter.f)
#--------
inter.f
   low medium   high 
   449    473     78 

And that is in some ways superior because the table function automagically honors the ordering of the factor labels.

Upvotes: 1

Ben Bolker
Ben Bolker

Reputation: 226087

How about:

cfun <- function(x) cut(x,c(-0.01,0.25,0.7,1.01),
              labels=c("low","medium","high"))
var1c <- cfun(var1)
var2c <- cfun(var2)
comb <- ifelse(var1c=="low" | var2c=="low", "low",
           ifelse(var1c=="medium" | var2c=="medium", "medium",
                "high"))

or actually, as suggested by other answers:

cfun(min(var1,var2))

Upvotes: 1

Michiel uit het Broek
Michiel uit het Broek

Reputation: 993

I do not know whether there is an inbuild function for this and I couldn't find it instantly. Can you use something like the following?

get_index <- function(var1, var2)
{
    if (var1 < 0 || var1 > 1 || var2 < 0 || var2 > 1)
        return("out of range");

    low <- min(var1, var2);
    if (low < 0.25)
        return("Low");
    if (low <= 0.70)
        return("Medium");

    return("High");
}

Upvotes: 1

AndreaG
AndreaG

Reputation: 13

I am beginner at R languange and syntax, but it seems you are more looking for a function rather than a procedure.

What about using f(var1, var2)=min(var1,var2)? Clearly, you have to apply this to the numeric version, and then categorize the variables.

Upvotes: 0

Related Questions