Reputation: 3264

Row-wise sum of first, second and third highest values

Considering the following sample data:

"A" "B" "C" "D" "E" "F" "G" "H" "I" "L" "K"
"1" 2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
"2" 23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
"3" 9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
"4" 2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
"5" 5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
"6" 8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76

I am struggling to find a solution that row-wise find and sum the three highest values.

Any idea?

Upvotes: 1

Answers (4)

moodymudskipper

Reputation: 47320

using tidyverse:

df %>%
  rowid_to_column() %>%
  gather("letter","value",-1) %>%
  group_by(rowid) %>%
  arrange(desc(value)) %>%
  slice(1:3) %>%
  summarize(value= sum(value))

# # A tibble: 6 x 2
#   rowid value
#   <int> <dbl>
# 1     1  66.5
# 2     2  56.4
# 3     3  76.8
# 4     4  63.4
# 5     5  90.6
# 6     6  76.5

Alternate solution, inspired by @andrew_reece's solution:

tdf <- setNames(as.data.frame(t(df)),seq(nrow(df)))
tdf %>%
  summarize_all(~sum(tail(sort(.),3))) %>%
  gather(rowid,value)

#   rowid value
# 1     1 66.46
# 2     2 56.44
# 3     3 76.83
# 4     4 63.39
# 5     5 90.62
# 6     6 76.47

note:

the real tidyverse way to do setNames(as.data.frame(t(df)),seq(nrow(df))) would be as follows:

df %>%
  rowid_to_column %>%
  gather("letter","value",-1) %>%
  spread("rowid","value")

Upvotes: 4

Martin Morgan

Reputation: 46866

if you had a vector x, a function to calculate the sum of the three largest values might be

fun = function(x)
    sum(tail(sort(x), 3))

You'd like to apply this to each row of your object m

apply(m, 1, fun)

A somewhat faster (e.g., 40%) implementation is

colSums(apply(m, 1, sort, decreasing = TRUE)[1:3, ])

Or using partial sorting

colSums(apply(m, 1, sort.int, partial = 9:11)[9:11, ])

For performance, if there are many rows in m, avoid the iteration over rows implied by use of apply(). An implementation might be

library(matrixStats)
rowSums(m * (rowRanks(m) > ncol(m) - 3))

but this fails when there are ties in a row of m; rowRanks() does not support ties.method = "first". Instead, implement our own rowRanks()

.rowRanks <- function(m) {
    m[] = sort.list(sort.list(m))
    rowRanks(m)
}
rowSums(m * (.rowRanks(m) > ncol(m) - 3))

For a tidyverse solution, it seems one wants to start with tidy data

tbl <- df %>%
  rowid_to_column() %>%
  gather("letter","value",-1) %>%
  group_by(rowid)

Solutions using top_n() generate incorrect results, so the best bet would seem to be

summarize(tbl, fun(value))

(fun could be expanded here, but that doesn't really seem to be a good idea because it makes it harder to separately modify and test).

Comparing the methods

f0 = function(m) apply(m, 1, fun)
f0a = function(m) colSums(apply(m, 1, sort, decreasing=TRUE)[1:3, ])
f0b = function(m) colSums(apply(m, 1, sort.int, partial = 9:11)[9:11, ])
f1 = function(m) rowSums(m * (.rowRanks(m) > ncol(m) - 3))
f2 = function(tbl) summarize(tbl, fun(value))

with

> library(microbenchmark)
> identical(f0(m), f0a(m))
[1] TRUE
> identical(f0(m), f0b(m))
[1] TRUE
> identical(f0(m), f1(m))
[1] TRUE
> identical(f0(m), f2(tbl)$`fun(value)`)
[1] TRUE
> microbenchmark(f0(m), f0a(m), f0b(m), f1(m), f2(tbl), times=10)
Unit: microseconds
    expr      min       lq      mean    median       uq      max neval
   f0(m)  837.505  860.890  894.9245  905.9880  921.386  948.242    10
  f0a(m)  594.895  637.258  650.7217  653.1800  673.599  713.167    10
  f0b(m)  274.925  277.734  305.6551  296.2975  330.482  362.765    10
   f1(m)  166.416  169.290  192.8086  189.5945  215.491  219.478    10
 f2(tbl) 2265.451 2277.599 2425.6083 2327.7015 2359.896 3349.995    10

> m = m[sample(nrow(m), 1000, TRUE),]
> microbenchmark(f0(m), f0a(m), f0b(m), f1(m), times=10)
Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
  f0(m) 137.705781 139.793459 141.658415 141.821540 143.653272 144.428092    10
 f0a(m)  85.946679  86.663967  88.500392  87.513880  89.634696  94.458554    10
 f0b(m)  29.762981  30.890124  32.470553  32.649594  33.116767  36.686603    10
  f1(m)   2.034407   2.120689   2.137723   2.144328   2.176306   2.184712    10

Upvotes: 6

andrew_reece

Reputation: 21264

You can take the transpose, and map over each column with top_n(3) and sum:

library(tidyverse)

tdf <- as.data.frame(t(df)) 
var_names <- names(tdf)

var_names %>%
  map_dfc(~tdf %>% select(.x) %>% top_n(3) %>% sum()) %>% 
  t()

    [,1]
V1 66.46
V2 56.44
V3 86.30
V4 63.39
V5 90.62
V6 76.47

Feels a bit hacky, but gets the job done.

Update A better way from Moody_Mudskipper:

var_names %>% 
  map_dfc(~tdf %>% select(.x) %>% top_n(3) %>% summarize_all(sum)) %>% 
  gather

Data:

df <- read.table(text='"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "L"
2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76', header=TRUE)

^ Note: Slightly different data than OP, row numbers are removed.

Upvotes: 3

Seymour

Reputation: 3264

df <- read.table('"A" "B" "C" "D" "E" "F" "G" "H" "I" "L" "K"
"1" 2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
"2" 23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
"3" 9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
"4" 2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
"5" 5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
"6" 8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76')

According to @Soheil suggestion, an R Base solution:

rowSums(t(apply(df, 1, FUN = function(x) sort(x, decreasing = TRUE)))[ , c(1,2,3)])

Upvotes: 2

Row-wise sum of first, second and third highest values

Answers (4)

Related Questions