Henry David Thorough
Henry David Thorough

Reputation: 900

Efficiently Extract Minimum Value AND Index for Each Column and Row in a Dataframe, then Rank by Value

I have a JxK dataframe M and I want to calculate the following.

  1. For each row j, the value k that minimizes M[j,k]
  2. For each column k, the value j that minimizes M[j,k]

Then, let the values satisfying the first be vector A_j and the second be vector A_k. Then, I need two vectors. Let vector C be the vector sort(c(A_j, A_k)).

  1. A vector of length equal to A_j where element i is the index of element A_j[i] in the combined and sorted vector C.
  2. A vector of length equal to A_k where element i is the index of element A_k[i] in the combined and sorted vector C.

For both of the two sorted vectors mentioned above, all ties should be given the first index at which that value appeared in vector C. That is, if A_j[i] and A_j[i+1] are equal, then element i and element i + 1 in the vector that satisfies condition #3 should both equal A_j[i]'s position in the sorted vector C.

As always, this is not hard to do inefficiently. However, in practice, the dataframe is very big, so inefficient solutions fail.

As a proof of concept, one solution would be as follows.

# Create the dataframe
set.seed(1)
df <- data.frame(matrix(rnorm(50, 8, 2), 10)) # A 10x5 matrix

# Calculate 1 and 2
A.j <- apply(df, 1, min) 
A.k <- apply(df, 2, min)

# Calculate 3 and 4
C <- sort(unname(c(A.j, A.k)))

A.j.indices <- apply(df, 1, function(x) which(x == min(x)))
A.k.indices <- apply(df, 2, function(x) which(x == min(x)))

vec3out <- c()
vec4out <- c()

for(j in 1:nrow(df)){
   rank <- which(C == A.j[j])[1] 
   vec3out <- c(vec3out, rank)
}

for(k in 1:ncol(df)){
   rank <- which(C == A.k[k])[1] 
   vec4out <- c(vec4out, rank)
}

Upvotes: 2

Views: 1473

Answers (1)

stanekam
stanekam

Reputation: 4030

For starters, you should use a matrix. Data.frames are less efficient (Should I use a data.frame or a matrix?). Then, we should use apply functions.

Let M be your data.frame coerced to a matrix.

M <- as.matrix(M)

minByRow <- apply(M, MARGIN=1, FUN=which.min)
minByCol <- apply(M, MARGIN=2, FUN=which.min)

combinedSorted <- sort(c(minByRow, minByCol))

byRowOutput <- match(minByRow, combinedSorted)
byColOutput <- match(minByCol, combinedSorted)

Here are the results for 1 million observations of 100 variables:

M <- matrix(data=rnorm(100000000), nrow=1000000, ncol=100)


system.time({
  minByRow <- apply(M, MARGIN=1, FUN=which.min)
  minByCol <- apply(M, MARGIN=2, FUN=which.min)

  combinedSorted <- sort(c(minByRow, minByCol))

  byRowOutput <- match(minByRow, combinedSorted)
  byColOutput <- match(minByCol, combinedSorted)
})

   user  system elapsed 
   7.37    0.46    7.93 

Upvotes: 2

Related Questions