user42719
user42719

Reputation: 141

Find Nth largest Across Columns (NOT in a vector)

Consider the following example:

Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)

DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))

In R, functions already exist to find the element-wise max and min, so I could easily create a new variable that is equal to the largest (or smallest) value across the columns of interest:

> DF$Max <- pmax(Var_A,Var_B,Var_C,Var_D)
> DF
  Var_A Var_B Var_C Var_D Max
1    44    33     6    72  72
2    29    66    51    12  66
3    35    29    47    79  79
4    39    79    47    65  79
5    97    60    36    81  97

But what if I need to create a variable that captures, say, the second largest value in each row (i.e., across the columns)?

In the real data set that I'm working with, I have 600+ columns and about 28 million records. I need to create variables that will identify and store the largest, second largest, third largest, etc. values found when looking across the variables (columns) for each record, much like pmax would do, but for other ordinals.

The only way that I have been able to functionally make it work on a subset of the data is to do a loop, but that loop won't finish in my lifetime if I run it on the entire data set. I also considered using the apply function, but my understanding is that apply will convert the data set to a matrix first, which my data set won't take kindly to.

Any suggestions on a non-loop way to do this? And with this amount of data, the faster the better...

Upvotes: 0

Views: 108

Answers (1)

maydin
maydin

Reputation: 3755

This may be a solution...

            Var_A <- sample(1:100,5,replace=TRUE)
            Var_B <- sample(1:100,5,replace=TRUE)
            Var_C <- sample(1:100,5,replace=TRUE)
            Var_D <- sample(1:100,5,replace=TRUE)

            DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))




            result <-sapply(1:nrow(DF), function(x) {

                    df <- as.data.frame(DF[x,])
                    ord <- df[order(-DF[x,])]




                    })

            result <- t(result)

            output <- cbind(DF,result)

            for (i in (ncol(DF)+1):ncol(output) )  {

            colnames(output)[i]<-paste0("Max",i-ncol(DF)) 

            }

            output

         Var_A Var_B Var_C Var_D Max1 Max2 Max3 Max4
      1    42    12    64     9   64   42   12    9
      2    67    22    47     4   67   47   22    4
      3    80    56    82    94   94   82   80   56
      4    31    62    88    73   88   73   62   31
      5    91    67    15    41   91   67   41   15

Upvotes: 1

Related Questions