akraf
akraf

Reputation: 3235

Create multi-row summaries with dplyr, for example grouped table()

My data set consists of observations of plants of multiple varieties (genotypes) which I have for multiple environments. This example data contains one line for each observation I have. It lists the plant genotype and in which environment the observation was made.

nObs <- 500
data <- tibble(Gt  = sample(sprintf("%02d",1:20),nObs,T),
               Env = sample(LETTERS\[1:8\],nObs,T)
               # The measured plant trait would also go here but is irrelevant for now.
               )

What I want is a table which shows me for each genotype in how many environments it is repeated how many times. For example, If my genotype 01 has two replications in five environments but only one in another two environments, the output I expect is:

Genotype  nRepl  nEnv
---------------------
 A        2      5
 A        1      2

I have found a possibility using do, but it is slow because do expects a data frame to be created for each data subgroup which contains the summary information. This creates a lot of overhead, as you can see when profiling:

profvis::profvis(for(1:10) idea1())  #idea1 is defined below

profiler output

Is there another approach, perhaps without using do(.), which gives this result?

My ideas up to now are shown below:


This returns the number of observations I have for each genotype in each environment (=replications):

data %>% count(Gt, Env) 

## # A tibble: 158 x 3
##    Gt    Env       n
##    <chr> <chr> <int>
##  1 01    A         2
##  2 01    B         1
##  3 01    C         2
##  4 01    D         2
##  5 01    E         6
##  6 01    F         4
##  7 01    G         5
##  8 01    H         3
##  9 02    A         1
## 10 02    B         2
## # ... with 148 more rows

These are the three approaches I could come up with so far. All use do:

idea1 <- function(){
  data %>% count(Gt, Env) %>% group_by(Gt) %>%
    do({
      t <- table(.$n)
      tibble(nRpl = names(t), nEnv = as.integer(t))
    })
}

idea2 <- function(){
  data %>% count(Gt, Env) %>% group_by(Gt) %>%
    do({
      t <- table(.$n)
      data.frame(nRpl = names(t), nEnv = as.integer(t), stringsAsFactors = FALSE)
    })
}

idea3 <- function(){
  data %>% count(Gt, Env) %>% group_by(Gt) %>%
    do({
      t <- table(.$n)
      data.frame(nRpl = names(t), nEnv = as.integer(t), stringsAsFactors = FALSE,
                 check.names = FALSE, check.rows = FALSE)
    })
}

This is the output I want:

idea1()

## # A tibble: 93 x 3
## # Groups:   Gt [20]
##    Gt    nRpl   nEnv
##    <chr> <chr> <int>
##  1 01    1         1
##  2 01    2         3
##  3 01    3         1
##  4 01    4         1
##  5 01    5         1
##  6 01    6         1
##  7 02    1         1
##  8 02    2         4
##  9 02    4         1
## 10 02    6         1
## # ... with 83 more rows

Speed comparison:

library(microbenchmark)
microbenchmark({idea1()},{idea3()},{idea4()}, times = 20)

## Unit: milliseconds
##             expr       min        lq     mean   median       uq      max
##  {     idea1() } 299.21640 373.87930 435.2567 427.8721 460.7488 693.9279
##  {     idea3() }  80.03300 100.25709 144.8413 124.8140 182.9211 300.3150
##  {     idea4() }  83.04371  98.18572 122.6464 108.1945 141.0208 216.3999
##  neval
##     20
##     20
##     20

Upvotes: 0

Views: 41

Answers (1)

alistaire
alistaire

Reputation: 43334

You can use count twice, the second time counting the number of replications calculated the first time:

library(dplyr)
set.seed(47)    # for sampling reproducibility

data <- tibble(Gt  = sample(sprintf("%02d", 1:20), 500, replace = TRUE),
               Env = sample(LETTERS[1:8], 500, replace = TRUE))

data %>% 
    count(Gt, Env) %>% 
    count(Gt, n) %>% 
    rename(nRpl = n, nEnv = nn)    # make better names
#> # A tibble: 85 x 3
#>    Gt     nRpl  nEnv
#>    <chr> <int> <int>
#>  1 01        2     1
#>  2 01        3     4
#>  3 01        4     3
#>  4 02        2     1
#>  5 02        3     1
#>  6 02        4     3
#>  7 02        6     1
#>  8 03        1     3
#>  9 03        2     1
#> 10 03        3     1
#> # ... with 75 more rows

Upvotes: 1

Related Questions