Reputation: 1237
I am trying to reorder a data.frame
that contains around 250,000 rows and 7 columns. The rows I want at the top of the data.frame
are those where column 2 contains the lowest value and column 7 the highest but would go in this sequence of columns that contain the lowest to highest values: 2,5,1,4,6,3,7 (so column 5 would have the second lowest value etc.).
Once any rows that match this sequence are identified it would move on to find rows where the columns values go from lowest to highest in the sequence 2,5,1,4,6 and then 2,5,1,4 and so on until only rows where column 2 is the lowest and the other column values are randomly assorted. Any row that does not have column 2 as the lowest value would be ignored and left unsorted below the sorted rows. I am struggling to come up with any workable solution to my problem - the best I can do in terms of providing similar data to that I am working with is this:
df<-data.frame(matrix(rnorm(70000), nrow=10000))
df<-abs(df)
If anyone has any ideas, I am all ears. Thanks!
Upvotes: 2
Views: 982
Reputation: 55420
You can use the fact that order()
returns the index to the ordering,
which is exactly what you are trying to match
For example if we apply `order` twice to each row of
[1,] 23 17 118 57 20 66 137
[2,] 56 42 52 66 47 8 29
[3,] 35 5 76 35 29 217 89
We would get
[1,] 2 5 1 4 6 3 7
[2,] 6 7 2 5 3 1 4
[3,] 2 5 1 4 3 7 6
Then you simply need to check which rows match what you are looking for.
comparisons
, which indicates whether each element of a row expectedOrder
. # assuming mydf is your data frame or matrix
# the expected order of the columns
expectedOrder <- c(2,5,1,4,6,3,7)
# apply the order function twice.
ordering <- apply(mydf, 1, function(r) order(r) )
# Recall that the output of apply is transposed relative to the input.
# We make use of this along with the recycling of vectors for the comparison
comparisons <- ordering == expectedOrder
# find all rows with at least matches to 2,5,1,4
topRows <- which(colSums(comparisons[1:4, ])==4)
# reorder the indecies based on the total number of matches in comparisons
# ie: first all 7-matches, then 5-matches, then 4-matches
topRows <- topRows[order(colSums(comparisons[,topRows]), decreasing=TRUE)]
# reorder the dataframe (or matrix)
mydf.ordered <-
rbind(mydf[topRows, ],
mydf[-topRows,])
head(mydf.ordered)
# X1 X2 X3 X4 X5 X6 X7
# 23 17 118 57 20 66 137
# 39 21 102 50 24 53 163
# 80 6 159 116 44 139 248
# 131 5 185 132 128 147 202
# 35 18 75 40 33 67 151
# 61 14 157 82 57 105 355
Upvotes: 0
Reputation: 89097
Like this:
dat <- as.matrix(df)
rnk <- t(apply(dat, 1, rank))
desiredRank <- order(c(2,5,1,4,6,3,7))
rnk.match <- rnk == matrix(desiredRank, nrow(rnk), ncol(rnk), byrow = TRUE)
match.score <- apply(rnk.match, 1, match, x = FALSE) - 1
match.score[is.na(match.score)] <- ncol(dat)
out <- dat[order(match.score, decreasing = TRUE), ]
head(out)
# X1 X2 X3 X4 X5 X6 X7
#[1,] 0.7740246 0.19692680 1.5665696 0.9623104 0.2882492 1.367786 1.8644204
#[2,] 0.5895921 0.00498982 1.7143083 1.2698382 0.1776051 2.494149 1.4216615
#[3,] 0.1981111 0.11379934 1.0379619 0.2130251 0.1660568 1.227547 0.9248101
#[4,] 0.7507257 0.23353923 1.6502192 1.2232615 0.7497352 2.032547 1.4409475
#[5,] 0.5418513 0.06987903 1.8882399 0.6923557 0.3681018 2.172043 1.2215323
#[6,] 0.1731943 0.01088604 0.6878847 0.2450998 0.0125614 1.197478 0.3087192
In this example, the first row matches the whole rank sequence; the next rows match the first five ranks of the sequence:
head(match.score[order(match.score, decreasing = TRUE)])
# [1] 7 5 5 5 5 5
Upvotes: 0
Reputation: 263479
One possible approach would be to weight rankings of the values in the columns. It would be something like rank regression. 7 columns of 250K rows is not that big. For the ones you want the low values to have higher weight you could either subtract the rank from NROW(dfrm). If you want to scale the wieighting across that column ordering scheme then jsut multiply by a weighting vector: say c(1, .6, .3, 0, .3, .6, 1)
dmat <- matrix(sample(20, 20*7, repl=TRUE), 20, 7)
dfrm <- as.data.frame(dmat)
dfrm$wt <- sapply( dfrm[ c(2,5,1,4,6,3,7)] , rank); dfrm
dfrm$wt[,1:3] <- rep(NROW(dfrm),3) - dfrm$wt[ , 1:3]
dfrm$wt <- dfrm$wt*rep(c(1, .6, .3, 0, .3, .6, 1), each=NROW(dfrm))
dfrm[ order( apply( dfrm$wt, 1, FUN=sum), decreasing=TRUE ) , ]
This does not force the lowest value for V2 to be first, since you implied a multiple criterion. You still have the ability to re-weight if this is not exactly what you imagined.
Upvotes: 0
Reputation: 115505
Given that you have a largish dataset of uniform type (numeric), I would suggest using a matrix not a data.frame
tt <- abs(matrix(rnorm(70000), nrow=10000))
You have a desired order you wish to match against
desiredOrder <- c(2,5,1,4,6,3,7)
You need to find what order each of your rows is in . I think it is easiest here to ensure that you are given a list back with an element for each row. I'd suggest something like this .
orders <- lapply(apply(tt, 1, function(x) list(order(x))), unlist)
You will then need to go through (from desiredOrder[seq_len(7)]
to desiredOrder[seq_len(1)]
to test when the required subset of the order for a particular row is equal to the required subset of desired order. (I thinking some combination of sapply
with which
and all
)
Once you have identified all the rows that match your required result, you can use setdiff
to find the unmatched ones, and then reorder tt
using this new order
vector.
Upvotes: 2