an_drade
an_drade

Reputation: 693

Given the minimum of column, find the minimum in other colunm (dplyr)

Suppose we have the following dataset in R:

> td
  Type Rep Value1 Value2
1    A   1      7      1
2    A   2      5      4
3    A   3      5      3
4    A   4      8      2
5    B   1      5     10
6    B   2      6      1
7    B   3      7      1
8    C   1      8     13
9    C   2      8     13

> td <- structure(list(Type = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Rep = c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L), Value1 = c(7L, 5L, 5L, 8L, 5L, 
6L, 7L, 8L, 8L), Value2 = c(1L, 4L, 3L, 2L, 10L, 1L, 1L, 13L, 
13L)), .Names = c("Type", "Rep", "Value1", "Value2"), class = "data.frame",
row.names = c(NA, -9L))

I would like to produce the following table:

  Type MinValue1 MinValue2 MeanValue1 MeanValue2
1    A         5         3       6.25        2.5
2    B         5        10       6.00        4.0
3    C         3        13       8.00       13.0

In this table, the data is summarized by "Type." Column "MinValue1" is the minimal value for a particular type and column "MinValue2" is the minimal value of "Value2", given the minimal value(s) of column "Value1". Columns Mean* are the general average of all observations.

One way to do this is to implement loops that iterate over each type and do the math. However, I am looking for a better/easy/beautiful way to perform such operation.

I have played with tools from "tidyverse":

> library(tidyverse)
> td %>% 
     group_by(Type) %>% 
     summarise(MinValue1 = min(Value1), 
               MeanValue1 = mean(Value1),
               MeanValue2 = mean(Value2))
# A tibble: 3 × 4
    Type MinValue1 MeanValue1 MeanValue2
  <fctr>    <int>       <dbl>      <dbl>
1      A        5        6.25        2.5
2      B        5        6.00        4.0
3      C        8        8.00       13.0

Note that we have not Column "MinValue2" here. Also note that "summarise(..., MinValue2 = min(Value2), ...)" does not work since this solution takes the minimum of all observations of one type.

We may play with "slice" and then merge the results:

> td %>% group_by(Type) %>% slice(which.min(Value1))
Source: local data frame [3 x 4]
Groups: Type [3]

    Type   Rep Value1 Value2
  <fctr> <int>  <int>  <int>
1      A     3      5      4
2      B     1      5     10
3      C     1      8     13

but note that the "slice" tool does not help us here: "Type A, Value1 5" should have "Value2" == 3, not == 4 as slice returns.

So, have you guys an elegant way to achieve the results I seek? Thanks!

Upvotes: 3

Views: 900

Answers (3)

an_drade
an_drade

Reputation: 693

Many thanks @evgeniC and @akrun. Your help is valuable. To my purposes/data set, both solutions work very well. So, to enrich a little bit the discussion, I run some experiments to test how fast these proposals are, using the following script (and of course, comment/uncomment on each experiment):

library(tidyverse)

args <- commandArgs(TRUE)
set.seed(args[1])
n = args[2]

td = data.frame(Type = sample(LETTERS, n, replace=T),
                Value1 = sample(1:100, n, replace=T),
                Value2 = sample(1:100, n, replace=T))

ptm <- proc.time()

# Solution 1 ###
#get_min_at_min <- function(vec1, vec2) {
  #return(vec2[order(vec1, vec2)[1]])
#}

#tmp <- td %>%
       #group_by(Type) %>%
       #summarise(MinValue1 = min(Value1),
                 #MinValue2 = get_min_at_min(Value1, Value2),
                 #MeanValue1 = mean(Value1),
                 #MeanValue2 = mean(Value2))

### Solution 2 ###
tmp <- td %>%
       group_by(Type) %>%
       summarise(MinValue1 = min(Value1),
                 MinValue2 = min(Value2[Value1 == MinValue1]),
                 MeanValue1 = mean(Value1),
                 MeanValue2 = mean(Value2))

### Solution 3 ###
#tmp <- td %>%
       #group_by(Type) %>%
       #group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%
       #summarise_each(funs(min, mean), Value1:Value2) %>%
       #select(-Value2_min)

print(proc.time() - ptm)

and for each algorithm, I run

$ Rscript test.R 270001 10000000

Using

> td %>% group_by(Alg) %>% summarise_each(funs(mean, sd), User:Elapsed)

we got the follwing results:

       Alg User_mean System_mean Elapsed_mean    User_sd   System_sd Elapsed_sd
1    akrun 1.3643333  0.13766667     1.510333 0.01069268 0.005033223 0.02050203
2 evgeniC1 0.8706667  0.07466667     0.951000 0.03323151 0.003055050 0.04073082
3 evgeniC2 0.8600000  0.09300000     0.958000 0.05546170 0.005196152 0.06331666

Therefore, I am inclined to use Solution 2 from @evgeniC since it is the most elegant/simple and it is as fast as Solution 1. @akrun presented a nice solution but it is a little bit more complicated and slow. Anyway, the setup can be useful in other situations.

Upvotes: 1

akrun
akrun

Reputation: 886938

After grouping by 'Type', create another group with the minimum of 'Value2' based on selecting the elements that corresponds to the minimum of 'Value1', use summarise_each to get the min and mean of the selected columns ('Value1' and 'Value2') and remove the 'Value2_min' with select

td %>% 
    group_by(Type) %>%
    group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%  
    summarise_each(funs(min, mean), Value1:Value2) %>% 
    select(-Value2_min)

Upvotes: 1

echasnovski
echasnovski

Reputation: 1181

One way to do it is to use property of order function to break ties with another vector:

get_min_at_min <- function(vec1, vec2) {
  return(vec2[order(vec1, vec2)[1]])
}

This returns minimum value of vec2 among those which indices correspond to minimum value of vec1. With this function piping is straightforward:

td %>% 
  group_by(Type) %>% 
  summarise(MinValue1 = min(Value1),
            MinValue2 = get_min_at_min(Value1, Value2),
            MeanValue1 = mean(Value1),
            MeanValue2 = mean(Value2))

Or just simply use the fact that one can address just computed variables inside dplyr functions:

td %>% 
  group_by(Type) %>% 
  summarise(MinValue1 = min(Value1),
            MinValue2 = min(Value2[Value1 == MinValue1]),
            MeanValue1 = mean(Value1),
            MeanValue2 = mean(Value2))

Upvotes: 2

Related Questions