Chris
Chris

Reputation: 1237

Reordering rows in a dataframe by multiple column permutations

I am trying to reorder a data.frame that contains around 250,000 rows and 7 columns. The rows I want at the top of the data.frame are those where column 2 contains the lowest value and column 7 the highest but would go in this sequence of columns that contain the lowest to highest values: 2,5,1,4,6,3,7 (so column 5 would have the second lowest value etc.).

Once any rows that match this sequence are identified it would move on to find rows where the columns values go from lowest to highest in the sequence 2,5,1,4,6 and then 2,5,1,4 and so on until only rows where column 2 is the lowest and the other column values are randomly assorted. Any row that does not have column 2 as the lowest value would be ignored and left unsorted below the sorted rows. I am struggling to come up with any workable solution to my problem - the best I can do in terms of providing similar data to that I am working with is this:

df<-data.frame(matrix(rnorm(70000), nrow=10000))
df<-abs(df)

If anyone has any ideas, I am all ears. Thanks!

Upvotes: 2

Views: 982

Answers (4)

Ricardo Saporta
Ricardo Saporta

Reputation: 55420

You can use the fact that order() returns the index to the ordering,
which is exactly what you are trying to match

For example if we apply `order` twice to each row of 
  [1,]   23   17  118   57   20   66  137
  [2,]   56   42   52   66   47    8   29
  [3,]   35    5   76   35   29  217   89

We would get
  [1,]    2    5    1    4    6    3    7
  [2,]    6    7    2    5    3    1    4
  [3,]    2    5    1    4    3    7    6

Then you simply need to check which rows match what you are looking for.



There are several ways to implement this, below is an example, where we create
a logical matrix, comparisons, which indicates whether each element of a row
is in the "correct" position, as indicated by expectedOrder.

We then order the original rows by how many elements
are in the "correct column". (using this phrase loosely, of course)

# assuming mydf is your data frame or matrix

# the expected order of the columns
expectedOrder <- c(2,5,1,4,6,3,7)

# apply the order function twice. 
ordering <- apply(mydf, 1, function(r) order(r) )  

# Recall that the output of apply is transposed relative to the input. 
# We make use of this along with the recycling of vectors for the comparison
comparisons <-  ordering == expectedOrder

# find all rows with at least matches to 2,5,1,4
topRows  <- which(colSums(comparisons[1:4, ])==4)

# reorder the indecies based on the total number of matches in comparisons
#  ie:  first all 7-matches, then 5-matches, then 4-matches
topRows <- topRows[order(colSums(comparisons[,topRows]), decreasing=TRUE)]

# reorder the dataframe (or matrix) 
mydf.ordered <- 
  rbind(mydf[topRows, ],  
        mydf[-topRows,])

head(mydf.ordered)
 #    X1 X2  X3  X4  X5  X6  X7
 #    23 17 118  57  20  66 137
 #    39 21 102  50  24  53 163
 #    80  6 159 116  44 139 248
 #   131  5 185 132 128 147 202
 #    35 18  75  40  33  67 151
 #    61 14 157  82  57 105 355

Upvotes: 0

flodel
flodel

Reputation: 89097

Like this:

dat <- as.matrix(df)
rnk <- t(apply(dat, 1, rank))
desiredRank <- order(c(2,5,1,4,6,3,7))
rnk.match <- rnk == matrix(desiredRank, nrow(rnk), ncol(rnk), byrow = TRUE)
match.score <- apply(rnk.match, 1, match, x = FALSE) - 1
match.score[is.na(match.score)] <- ncol(dat)
out <- dat[order(match.score, decreasing = TRUE), ]

head(out)
#            X1         X2        X3        X4        X5       X6        X7
#[1,] 0.7740246 0.19692680 1.5665696 0.9623104 0.2882492 1.367786 1.8644204
#[2,] 0.5895921 0.00498982 1.7143083 1.2698382 0.1776051 2.494149 1.4216615
#[3,] 0.1981111 0.11379934 1.0379619 0.2130251 0.1660568 1.227547 0.9248101
#[4,] 0.7507257 0.23353923 1.6502192 1.2232615 0.7497352 2.032547 1.4409475
#[5,] 0.5418513 0.06987903 1.8882399 0.6923557 0.3681018 2.172043 1.2215323
#[6,] 0.1731943 0.01088604 0.6878847 0.2450998 0.0125614 1.197478 0.3087192

In this example, the first row matches the whole rank sequence; the next rows match the first five ranks of the sequence:

head(match.score[order(match.score, decreasing = TRUE)])
# [1] 7 5 5 5 5 5

Upvotes: 0

IRTFM
IRTFM

Reputation: 263479

One possible approach would be to weight rankings of the values in the columns. It would be something like rank regression. 7 columns of 250K rows is not that big. For the ones you want the low values to have higher weight you could either subtract the rank from NROW(dfrm). If you want to scale the wieighting across that column ordering scheme then jsut multiply by a weighting vector: say c(1, .6, .3, 0, .3, .6, 1)

dmat <- matrix(sample(20, 20*7, repl=TRUE), 20, 7)
dfrm <- as.data.frame(dmat)

dfrm$wt <- sapply( dfrm[ c(2,5,1,4,6,3,7)] , rank); dfrm
 dfrm$wt[,1:3]  <- rep(NROW(dfrm),3) - dfrm$wt[ , 1:3] 
 dfrm$wt <- dfrm$wt*rep(c(1, .6, .3,  0, .3, .6, 1), each=NROW(dfrm))
 dfrm[ order( apply( dfrm$wt, 1, FUN=sum), decreasing=TRUE ) , ]

This does not force the lowest value for V2 to be first, since you implied a multiple criterion. You still have the ability to re-weight if this is not exactly what you imagined.

Upvotes: 0

mnel
mnel

Reputation: 115505

Given that you have a largish dataset of uniform type (numeric), I would suggest using a matrix not a data.frame

 tt <- abs(matrix(rnorm(70000), nrow=10000))

You have a desired order you wish to match against

 desiredOrder <- c(2,5,1,4,6,3,7)

You need to find what order each of your rows is in . I think it is easiest here to ensure that you are given a list back with an element for each row. I'd suggest something like this .

 orders <- lapply(apply(tt, 1, function(x) list(order(x))), unlist)

You will then need to go through (from desiredOrder[seq_len(7)] to desiredOrder[seq_len(1)] to test when the required subset of the order for a particular row is equal to the required subset of desired order. (I thinking some combination of sapply with which and all)

Once you have identified all the rows that match your required result, you can use setdiff to find the unmatched ones, and then reorder tt using this new order vector.

Upvotes: 2

Related Questions