Reputation: 693
Suppose we have the following dataset in R:
> td
Type Rep Value1 Value2
1 A 1 7 1
2 A 2 5 4
3 A 3 5 3
4 A 4 8 2
5 B 1 5 10
6 B 2 6 1
7 B 3 7 1
8 C 1 8 13
9 C 2 8 13
> td <- structure(list(Type = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Rep = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L), Value1 = c(7L, 5L, 5L, 8L, 5L,
6L, 7L, 8L, 8L), Value2 = c(1L, 4L, 3L, 2L, 10L, 1L, 1L, 13L,
13L)), .Names = c("Type", "Rep", "Value1", "Value2"), class = "data.frame",
row.names = c(NA, -9L))
I would like to produce the following table:
Type MinValue1 MinValue2 MeanValue1 MeanValue2
1 A 5 3 6.25 2.5
2 B 5 10 6.00 4.0
3 C 3 13 8.00 13.0
In this table, the data is summarized by "Type." Column "MinValue1" is the minimal value for a particular type and column "MinValue2" is the minimal value of "Value2", given the minimal value(s) of column "Value1". Columns Mean* are the general average of all observations.
One way to do this is to implement loops that iterate over each type and do the math. However, I am looking for a better/easy/beautiful way to perform such operation.
I have played with tools from "tidyverse":
> library(tidyverse)
> td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
# A tibble: 3 × 4
Type MinValue1 MeanValue1 MeanValue2
<fctr> <int> <dbl> <dbl>
1 A 5 6.25 2.5
2 B 5 6.00 4.0
3 C 8 8.00 13.0
Note that we have not Column "MinValue2" here. Also note that "summarise(..., MinValue2 = min(Value2), ...)" does not work since this solution takes the minimum of all observations of one type.
We may play with "slice" and then merge the results:
> td %>% group_by(Type) %>% slice(which.min(Value1))
Source: local data frame [3 x 4]
Groups: Type [3]
Type Rep Value1 Value2
<fctr> <int> <int> <int>
1 A 3 5 4
2 B 1 5 10
3 C 1 8 13
but note that the "slice" tool does not help us here: "Type A, Value1 5" should have "Value2" == 3, not == 4 as slice returns.
So, have you guys an elegant way to achieve the results I seek? Thanks!
Upvotes: 3
Views: 900
Reputation: 693
Many thanks @evgeniC and @akrun. Your help is valuable. To my purposes/data set, both solutions work very well. So, to enrich a little bit the discussion, I run some experiments to test how fast these proposals are, using the following script (and of course, comment/uncomment on each experiment):
library(tidyverse)
args <- commandArgs(TRUE)
set.seed(args[1])
n = args[2]
td = data.frame(Type = sample(LETTERS, n, replace=T),
Value1 = sample(1:100, n, replace=T),
Value2 = sample(1:100, n, replace=T))
ptm <- proc.time()
# Solution 1 ###
#get_min_at_min <- function(vec1, vec2) {
#return(vec2[order(vec1, vec2)[1]])
#}
#tmp <- td %>%
#group_by(Type) %>%
#summarise(MinValue1 = min(Value1),
#MinValue2 = get_min_at_min(Value1, Value2),
#MeanValue1 = mean(Value1),
#MeanValue2 = mean(Value2))
### Solution 2 ###
tmp <- td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
### Solution 3 ###
#tmp <- td %>%
#group_by(Type) %>%
#group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%
#summarise_each(funs(min, mean), Value1:Value2) %>%
#select(-Value2_min)
print(proc.time() - ptm)
and for each algorithm, I run
$ Rscript test.R 270001 10000000
Using
> td %>% group_by(Alg) %>% summarise_each(funs(mean, sd), User:Elapsed)
we got the follwing results:
Alg User_mean System_mean Elapsed_mean User_sd System_sd Elapsed_sd
1 akrun 1.3643333 0.13766667 1.510333 0.01069268 0.005033223 0.02050203
2 evgeniC1 0.8706667 0.07466667 0.951000 0.03323151 0.003055050 0.04073082
3 evgeniC2 0.8600000 0.09300000 0.958000 0.05546170 0.005196152 0.06331666
Therefore, I am inclined to use Solution 2 from @evgeniC since it is the most elegant/simple and it is as fast as Solution 1. @akrun presented a nice solution but it is a little bit more complicated and slow. Anyway, the setup can be useful in other situations.
Upvotes: 1
Reputation: 886938
After grouping by 'Type', create another group with the minimum
of 'Value2' based on selecting the elements that corresponds to the minimum of 'Value1', use summarise_each
to get the min
and mean
of the selected columns ('Value1' and 'Value2') and remove the 'Value2_min' with select
td %>%
group_by(Type) %>%
group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%
summarise_each(funs(min, mean), Value1:Value2) %>%
select(-Value2_min)
Upvotes: 1
Reputation: 1181
One way to do it is to use property of order
function to break ties with another vector:
get_min_at_min <- function(vec1, vec2) {
return(vec2[order(vec1, vec2)[1]])
}
This returns minimum value of vec2
among those which indices correspond to minimum value of vec1
. With this function piping is straightforward:
td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = get_min_at_min(Value1, Value2),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
Or just simply use the fact that one can address just computed variables inside dplyr
functions:
td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
Upvotes: 2