Reputation: 3264
Considering the following sample data:
"A" "B" "C" "D" "E" "F" "G" "H" "I" "L" "K"
"1" 2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
"2" 23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
"3" 9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
"4" 2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
"5" 5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
"6" 8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76
I am struggling to find a solution that row-wise find and sum the three highest values.
Any idea?
Upvotes: 1
Views: 131
Reputation: 47320
using tidyverse
:
df %>%
rowid_to_column() %>%
gather("letter","value",-1) %>%
group_by(rowid) %>%
arrange(desc(value)) %>%
slice(1:3) %>%
summarize(value= sum(value))
# # A tibble: 6 x 2
# rowid value
# <int> <dbl>
# 1 1 66.5
# 2 2 56.4
# 3 3 76.8
# 4 4 63.4
# 5 5 90.6
# 6 6 76.5
Alternate solution, inspired by @andrew_reece's solution:
tdf <- setNames(as.data.frame(t(df)),seq(nrow(df)))
tdf %>%
summarize_all(~sum(tail(sort(.),3))) %>%
gather(rowid,value)
# rowid value
# 1 1 66.46
# 2 2 56.44
# 3 3 76.83
# 4 4 63.39
# 5 5 90.62
# 6 6 76.47
note:
the real tidyverse
way to do setNames(as.data.frame(t(df)),seq(nrow(df)))
would be as follows:
df %>%
rowid_to_column %>%
gather("letter","value",-1) %>%
spread("rowid","value")
Upvotes: 4
Reputation: 46866
if you had a vector x
, a function to calculate the sum of the three largest values might be
fun = function(x)
sum(tail(sort(x), 3))
You'd like to apply this to each row of your object m
apply(m, 1, fun)
A somewhat faster (e.g., 40%) implementation is
colSums(apply(m, 1, sort, decreasing = TRUE)[1:3, ])
Or using partial sorting
colSums(apply(m, 1, sort.int, partial = 9:11)[9:11, ])
For performance, if there are many rows in m
, avoid the iteration over rows implied by use of apply()
. An implementation might be
library(matrixStats)
rowSums(m * (rowRanks(m) > ncol(m) - 3))
but this fails when there are ties in a row of m
; rowRanks()
does not support ties.method = "first"
. Instead, implement our own rowRanks()
.rowRanks <- function(m) {
m[] = sort.list(sort.list(m))
rowRanks(m)
}
rowSums(m * (.rowRanks(m) > ncol(m) - 3))
For a tidyverse solution, it seems one wants to start with tidy data
tbl <- df %>%
rowid_to_column() %>%
gather("letter","value",-1) %>%
group_by(rowid)
Solutions using top_n()
generate incorrect results, so the best bet would seem to be
summarize(tbl, fun(value))
(fun
could be expanded here, but that doesn't really seem to be a good idea because it makes it harder to separately modify and test).
Comparing the methods
f0 = function(m) apply(m, 1, fun)
f0a = function(m) colSums(apply(m, 1, sort, decreasing=TRUE)[1:3, ])
f0b = function(m) colSums(apply(m, 1, sort.int, partial = 9:11)[9:11, ])
f1 = function(m) rowSums(m * (.rowRanks(m) > ncol(m) - 3))
f2 = function(tbl) summarize(tbl, fun(value))
with
> library(microbenchmark)
> identical(f0(m), f0a(m))
[1] TRUE
> identical(f0(m), f0b(m))
[1] TRUE
> identical(f0(m), f1(m))
[1] TRUE
> identical(f0(m), f2(tbl)$`fun(value)`)
[1] TRUE
> microbenchmark(f0(m), f0a(m), f0b(m), f1(m), f2(tbl), times=10)
Unit: microseconds
expr min lq mean median uq max neval
f0(m) 837.505 860.890 894.9245 905.9880 921.386 948.242 10
f0a(m) 594.895 637.258 650.7217 653.1800 673.599 713.167 10
f0b(m) 274.925 277.734 305.6551 296.2975 330.482 362.765 10
f1(m) 166.416 169.290 192.8086 189.5945 215.491 219.478 10
f2(tbl) 2265.451 2277.599 2425.6083 2327.7015 2359.896 3349.995 10
> m = m[sample(nrow(m), 1000, TRUE),]
> microbenchmark(f0(m), f0a(m), f0b(m), f1(m), times=10)
Unit: milliseconds
expr min lq mean median uq max neval
f0(m) 137.705781 139.793459 141.658415 141.821540 143.653272 144.428092 10
f0a(m) 85.946679 86.663967 88.500392 87.513880 89.634696 94.458554 10
f0b(m) 29.762981 30.890124 32.470553 32.649594 33.116767 36.686603 10
f1(m) 2.034407 2.120689 2.137723 2.144328 2.176306 2.184712 10
Upvotes: 6
Reputation: 21264
You can take the transpose, and map
over each column with top_n(3)
and sum
:
library(tidyverse)
tdf <- as.data.frame(t(df))
var_names <- names(tdf)
var_names %>%
map_dfc(~tdf %>% select(.x) %>% top_n(3) %>% sum()) %>%
t()
[,1]
V1 66.46
V2 56.44
V3 86.30
V4 63.39
V5 90.62
V6 76.47
Feels a bit hacky, but gets the job done.
Update A better way from Moody_Mudskipper:
var_names %>%
map_dfc(~tdf %>% select(.x) %>% top_n(3) %>% summarize_all(sum)) %>%
gather
Data:
df <- read.table(text='"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "L"
2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76', header=TRUE)
^ Note: Slightly different data than OP, row numbers are removed.
Upvotes: 3
Reputation: 3264
df <- read.table('"A" "B" "C" "D" "E" "F" "G" "H" "I" "L" "K"
"1" 2.59 23.08 4.61 10.32 5.61 0.05 0.34 24.04 19.34 5.76 4.27
"2" 23.31 1.84 15.34 1.23 0 0 6.13 12.88 6.13 17.18 15.95
"3" 9.47 0 13.68 9.47 0 0 0 2.11 5.26 6.32 53.68
"4" 2.16 29.99 4.58 7.92 0.92 0 0 8.97 16.57 12.05 16.83
"5" 5.4 76.35 2.06 8.87 0 0 0.26 0.39 0.9 2.7 3.08
"6" 8.24 30 7.65 0 1.18 0 0 4.71 22.94 23.53 1.76')
According to @Soheil suggestion, an R Base solution:
rowSums(t(apply(df, 1, FUN = function(x) sort(x, decreasing = TRUE)))[ , c(1,2,3)])
Upvotes: 2