doctorate
doctorate

Reputation: 1413

Sorting features based on their importance in CARET package

In caret package and help system for related varImp() there is:

Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the sums of squares.

Below is the output of variable importance of a classification model by caret package method="pls":

> varImp(plsFitvac)
   pls variable importance

  variables are sorted by average importance across the classes
             H       P      R     Q
IL17A    9.516 100.000 19.813 61.20
IL8     17.814   1.344 80.628 34.33
IL6ST   10.319  75.452 62.296 68.41
IL23A    7.662  55.422 43.188 44.17
IL27RA  10.311   0.000 45.932 24.76
IL12RB2 15.497  28.467 38.848 33.73
IL12B   13.569  22.799 32.728 27.25
IL12RB1 12.292  23.431  6.395 18.67
IL12A   10.394  22.774 12.330 18.94
EBI3    12.039   6.932 14.877 11.01
IL23R   13.053  10.018  9.708 13.22  

That's fine, but when I extract this data frame by this line of code:

df <- varImp(plsFitvac)$importance  

I get the same of above but unsorted, it would be very nice though if sorted. Anyway, to sort this data frame based on the average importance across classes (as stated in the output) I did this:

df$Sort <- apply(df, 1, sum)
df$Sort <- df$Sort/ncol(df) # not needed since sum and average will be sorted alike
df[order(df$Sort,decreasing=TRUE),]

> df[order(df$Sort,decreasing=TRUE),]
                H          P         R        Q      Sort
IL6ST   10.318521  75.451572 62.295779 68.40740 43.294655
IL17A    9.515726 100.000000 19.813439 61.20098 38.106029
IL23A    7.662351  55.422249 43.187811 44.16892 30.088267
IL8     17.813522   1.343589 80.628315 34.32519 26.822122
IL12RB2 15.497069  28.466890 38.847943 33.73476 23.309331
IL12B   13.569266  22.798682 32.727759 27.24567 19.268275
IL27RA  10.311489   0.000000 45.932101 24.76301 16.201321
IL12A   10.393673  22.773860 12.329890 18.94323 12.888131
IL12RB1 12.291526  23.431046  6.395495 18.66685 12.156983
IL23R   13.053380  10.018339  9.708473 13.22094  9.200227
EBI3    12.039321   6.931682 14.877214 11.00619  8.970881  

So that ended up with a different version than that of the sorted list of caret via varImp() function. Am I missing something here? Thanks.

Note:
I didn't pass importance = TRUE argument to train() call for a PLSDA model, i.e., method = "pls".

$importance

> dput(df)
structure(list(H = c(17.8135216215421, 9.51572613703257, 7.66235106434041, 
13.0533801732928, 12.0393206867905, 10.3185210244416, 10.3936725783446, 
15.4970686175322, 13.569265567599, 12.291526066084, 10.3114887728613
), P = c(1.34358921525031, 100, 55.4222485106407, 10.0183388053119, 
6.93168239216908, 75.4515720604057, 22.7738599760963, 28.4668895810321, 
22.7986823025468, 23.4310464801875, 0), R = c(80.6283150180913, 
19.8134392303359, 43.1878112878907, 9.70847280019312, 14.8772141493434, 
62.2957787591232, 12.3298895434334, 38.8479426109151, 32.7277593254102, 
6.39549491068232, 45.932101268196), Q = c(34.3251855315416, 61.2009790458015, 
44.1689231007598, 13.2209412495112, 11.0061874803613, 68.4074013762385, 
18.9432341406872, 33.7347566350668, 27.2456691770754, 18.6668467881651, 
24.7630136095146)), .Names = c("H", "P", "R", "Q"), row.names = c("IL8", 
"IL17A", "IL23A", "IL23R", "EBI3", "IL6ST", "IL12A", "IL12RB2", 
"IL12B", "IL12RB1", "IL27RA"), class = "data.frame")  

Question:

How to measure importance across classes? can I trust the varImp() output unsorted?

EDIT:
the method by max() to rank importance of variables:

vi <- varImp(plsFitvac)$importance  
vi$max <- apply(vi, 1, max)
vi[order(-vi$max),]  

resulted in the same of varImp():

varImp(plsFitvac)  

which yielded this:

> vi[order(-vi$max),]
                H          P         R        Q       max
IL17A    9.515726 100.000000 19.813439 61.20098 100.00000
IL8     17.813522   1.343589 80.628315 34.32519  80.62832
IL6ST   10.318521  75.451572 62.295779 68.40740  75.45157
IL23A    7.662351  55.422249 43.187811 44.16892  55.42225
IL27RA  10.311489   0.000000 45.932101 24.76301  45.93210
IL12RB2 15.497069  28.466890 38.847943 33.73476  38.84794
IL12B   13.569266  22.798682 32.727759 27.24567  32.72776
IL12RB1 12.291526  23.431046  6.395495 18.66685  23.43105
IL12A   10.393673  22.773860 12.329890 18.94323  22.77386
EBI3    12.039321   6.931682 14.877214 11.00619  14.87721
IL23R   13.053380  10.018339  9.708473 13.22094  13.22094  

but using sum() of importance across classes yielded a different ranking (see above). So which one is correct and what happens in case of ties in the max() method?

Upvotes: 0

Views: 5527

Answers (2)

onur &#246;ztornacı
onur &#246;ztornacı

Reputation: 13

Try using, write.csv2(varImp(vi),"vi.csv") and you can do sort in excel.

Upvotes: 0

topepo
topepo

Reputation: 14316

The output shown using varImp(plsFitvac) is formatted and shown to some abbreviated level of precision:

> format(9.515726, digits = 4)
[1] "9.516"

Try using various values of digits in this code:

format(varImp(plsFit)$importance, digits = 4)

and you should be able to see that they are the same values.

When you print the data frame, print.data.frame uses digits = getOption("digits") while print.varImp.train uses max(3, getOption("digits") - 3).

The default value of getOption("digits") gives me a headache, which is my function is the way that it is.

EDIT: if the question is about the ordering, the way the function ranks these is to find the maximum importance across the classes for each predictor and order them based on that. There is a little more to it (in case of ties etc) and the code is in the undocumented internal function sortImp. This code should approximate that function:

vi$max <- apply(vi, 1, max)
vi[order(-vi$max),]

Max

Upvotes: 2

Related Questions