Reputation: 356

Sort 2 columns simultaneously based on the group name

Dataset

> read.delim("df.tsv")
   col1 col2 group
1     3    2    aa
2     1    1    aa
3     4    1    aa
4     4    3    aa
5     5    3    ab
6     3    2    ab
7     4    1    ab
8     2    4    ab
9     4    2    ba
10    1    4    ba
11    3    1    ba
12    4    3    ba
13    4    2    bb
14    2    3    bb
15    3    1    bb
16    1    2    bb

I want to sort the columns col1 and col2 within each of the 4 groups, in the following way:

If the 1st character in the group name is "a", sort col1 in a descending manner, and ascending if it's "b"
If the 2nd character in the group name is "a", sort col2 in a descending manner, and ascending if it's "b"
Importantly, I would like that both columns are sorted simultaneously, i.e. if e.g. the group is "aa", the sorting for that group should look like this:

   col1 col2 group
1     4    3    aa
2     3    2    aa
3     4    1    aa
4     1    1    aa
...

This could be accomplished by e.g. a "one row at a time" approach, first col1 and then col2, alternating for each row.

Current code and output

library(dplyr)

read.delim("df.tsv") %>%
  group_by(group) %>%
  arrange(ifelse(substr(group, 1,1) == "a", desc(col1), col1), # if first character in group name is "a", sort col1 in a descending manner, and ascending if it's "b"
          ifelse(substr(group, 2,2) == "a", desc(col2), col2), # if second character in group name is also "a", sort also col2 in a descending manner, and ascending if it's "b"
          .by_group = TRUE)

    col1  col2 group
 1     4     3 aa   
 2     4     1 aa   
 3     3     2 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     3     1 ba   
11     4     3 ba   
12     4     2 ba   
13     1     2 bb   
14     2     3 bb   
15     3     1 bb   
16     4     2 bb

However, this does not fulfill the 3rd criterion, the "simultaneous sorting one row at a time".

Desired output

    col1  col2 group
 1     4     3 aa   
 2     3     2 aa   
 3     4     1 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     4     3 ba   
11     3     1 ba   
12     4     2 ba   
13     1     2 bb   
14     3     1 bb   
15     2     3 bb   
16     4     2 bb

EDIT

There are a couple of answers that actually do the proposed task, so I think a tie-breaker could be that the algorithm is flexible with respect to the number of columns to sort, e.g. 3:

col1    col2    col3    group
3   2   4   aaa
1   1   2   aaa
4   1   4   aaa
4   3   1   aaa
5   3   3   aab
3   2   2   aab
4   1   1   aab
2   4   1   aab
4   2   3   aba
1   4   3   aba
3   1   2   aba
4   3   3   aba
3   2   4   abb
1   1   2   abb
4   1   4   abb
4   3   1   abb
4   2   1   baa
2   3   2   baa
3   1   2   baa
1   2   1   baa
5   3   3   bab
3   2   2   bab
4   1   1   bab
2   4   1   bab
4   2   3   bba
1   4   3   bba
3   1   2   bba
4   3   3   bba
4   2   1   bbb
2   3   2   bbb
3   1   2   bbb
1   2   1   bbb

The output should be

col1    col2    col3    group
4   3   1   aaa
3   2   4   aaa
4   1   4   aaa
1   1   2   aaa
5   3   3   aab
2   4   1   aab
4   1   1   aab
3   2   2   aab
4   2   3   aba
3   1   2   aba
4   3   3   aba
1   4   3   aba
4   1   4   abb
1   1   2   abb
4   3   1   abb
3   2   4   abb
1   2   1   baa
2   3   2   baa
3   1   2   baa
4   2   1   baa
2   4   1   bab
5   3   3   bab
4   1   1   bab
3   2   2   bab
1   4   3   bba
3   1   2   bba
4   2   3   bba
4   3   3   bba
1   2   1   bbb
3   1   2   bbb
4   2   1   bbb
2   3   2   bbb

Currently the 2 suggested solutions do not work when 3 or more columns are included, they sort based on only 2 columns.

EDIT 2

If e.g. group=='aba', the first row of this group should be the one that includes the highest value in col1; the 2nd row the one that includes the (remaining) lowest value in col2; the 3rd row the one that includes the (remaining) highest value in col3, and the 4th row is the remaining row. However, this should be flexible to allow for more than 4 rows per group, in that case the 4th row should be the one that includes the (remaining) highest value in col1; the 5th row should be the one that includes the (remaining) lowest value in col2; etc.

More details

Example: For the 2nd row of the 'aba' group, in the case that there is a tie between 2 rows for the lowest (remaining) value in col2, e.g.

row-a 3 1 4 aba
row-b 2 1 4 aba

(notice that there is a 1 in col2 in both rows), ideally then the chosen 2nd row would be row-a, since the col1 has to be sorted in a descending manner ('a') in this group, and 3>2, and for col3 4==4 anyway.

If instead

row-a 3 1 4 aba
row-b 2 1 5 aba

let the priority go col3>col2>col1, since the cycle goes col1>col2>col3... so the 2nd row would be row-b, since 5>4.

So to generalize, if there were 5 columns and the group were 'aabaa', and there is a tie for choosing the 3rd row between 2 rows:

row-a 3 2 1 3 3 aabaa
row-b 5 4 1 4 2 aabaa

(col3 == 1 in both), then the one to select would be row-a since for col5 3>2. If instead

row-a--> 3 2 1 3 3
row-b--> 5 4 1 4 3

(col5==3 in both), then choose row-b since for col4 4>3.

Upvotes: 3

Answers (3)

ekoam

Reputation: 8844

On second thought, I think I can just pass that option to you. You can specify any cycling method you want now.

alt_order <- function(..., type, cyc) {
  cols <- unname(list(...))
  stopifnot(
    # sanity checks; you may skip if you think they are unnecessary
    length(unique(lengths(cols))) == 1L,
    length(cols) == length(type),
    all(unlist(type) %in% c(1L, -1L))
  ) 
  cols <- mapply(`*`, cols, type, SIMPLIFY = FALSE)
  out <- integer(length(cols[[1L]]))
  this <- cols
  for (i in seq_along(out)) {
    out[[i]] <- do.call(order, this)[[1L]]
    cols <- lapply(cols, `is.na<-`, out[[i]])
    this <- cols[cyc(i)]
  }
  out
}

cyc should be a function that accepts a single integer as input and returns a vector of integers. For example, if you have 3 columns and you want to replicate the rev cycling behavior as I described in the comment below, you can do this

mycyc <- function(i) list(1:3, 3:1)[[(i - 1) %% 2L + 1L]]
df %>% group_by(group) %>% slice(alt_order(col1, col2, col3, type = ab2sign(group), cyc = mycyc))

Well, perhaps a not efficient but simple solution is to just sort the two columns continuously, swap the major column each time, and discharge the first element until no element is left to be sorted. Here is the function.

alt_order <- function(..., type) {
  cols <- unname(list(...))
  stopifnot(
    # sanity checks; you may skip if you think they are unnecessary
    length(unique(lengths(cols))) == 1L,
    length(cols) == length(type),
    all(unlist(type) %in% c(1L, -1L))
  ) 
  cols <- mapply(`*`, cols, type, SIMPLIFY = FALSE)
  out <- integer(length(cols[[1L]]))
  for (i in seq_along(out)) {
    out[[i]] <- do.call(order, cols)[[1L]]
    cols <- rev(lapply(cols, `is.na<-`, out[[i]]))
  }
  out
}

We assign values to NAs to discharge them since NAs will be sorted to the last in an ascending way. type should be either 1 or -1 and is used to streamline the order we would like to impose since the descending order of c(1,2,3) is the same as the ascending order of -1 * c(1,2,3). We also need a helper function as follows to transfer your groups into 1 and -1

ab2sign <- function(x) {
  out <- data.table::transpose(strsplit(x, "", fixed = TRUE))
  lapply(out, function(x) 2L * (x == "b") - 1L)
}

Now we can apply them

df %>% group_by(group) %>% slice(alt_order(col1, col2, type = ab2sign(group)))

Output

# A tibble: 16 x 3
# Groups:   group [4]
    col1  col2 group
   <int> <int> <chr>
 1     4     3 aa   
 2     3     2 aa   
 3     4     1 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     4     3 ba   
11     3     1 ba   
12     4     2 ba   
13     1     2 bb   
14     3     1 bb   
15     2     3 bb   
16     4     2 bb

Perhaps more efficient (perhaps vectorized) solutions are available.

Upvotes: 4

ThomasIsCoding

Reputation: 102770

Update

Below is an option may work in general cases, i.e., more than 2 columns:

f <- function(.) {
  col <- .[-length(.)] * (2 * (t(list2DF(strsplit(.$group, ""))) == "b") - 1)
  r <- data.frame()
  while (nrow(.)) {
    p <- do.call(order, col[(seq_along(col) + nrow(r) - 1) %% length(col) + 1])[1]
    r <- rbind(r, .[p, ])
    col <- col[-p, ]
    . <- .[-p, ]
  }
  r
}

df %>%
  group_by(group) %>%
  do(f(.)) %>%
  ungroup()

which gives

    col1  col2  col3 group
   <int> <int> <int> <chr>
 1     4     3     1 aaa
 2     3     2     4 aaa
 3     4     1     4 aaa
 4     1     1     2 aaa
 5     5     3     3 aab
 6     2     4     1 aab
 7     4     1     1 aab
 8     3     2     2 aab
 9     4     2     3 aba
10     3     1     2 aba
# ... with 22 more rows

Here is an option using dynamic programming (but maybe not that efficient)

f <- function(.) {
  col <- with(., data.frame(col1, col2) * (2 * (t(list2DF(strsplit(.$group, ""))) == "b") - 1))
  r <- data.frame()
  while (nrow(.)) {
    p <- do.call(order, ifelse(nrow(r) %% 2, rev, I)(col))[1]
    r <- rbind(r, .[p, ])
    col <- col[-p,]
    . <- .[-p, ]
  }
  r
}

df %>%
  group_by(group) %>%
  do(f(.)) %>%
  ungroup()

which gives

# A tibble: 16 x 3
    col1  col2 group
   <int> <int> <chr>
 1     4     3 aa
 2     3     2 aa
 3     4     1 aa
 4     1     1 aa
 5     5     3 ab
 6     4     1 ab
 7     3     2 ab
 8     2     4 ab
 9     1     4 ba
10     4     3 ba
11     3     1 ba
12     4     2 ba
13     1     2 bb
14     3     1 bb
15     2     3 bb
16     4     2 bb

Upvotes: 2

jumper

Reputation: 5

I can tell you the answer, but I cannot write complete r code for it, since I don't know r, I hope some one may edit my code for a complete answer.

suppose both sorts are ascending (you can generalize it to your case)

idx1=order(col1)
idx2=order(col2[idx1])

return col1[idx1[idx2]], col2[idx1[idx2]]

Upvotes: -2