Reputation: 3235
My data set consists of observations of plants of multiple varieties (genotypes) which I have for multiple environments. This example data contains one line for each observation I have. It lists the plant genotype and in which environment the observation was made.
nObs <- 500
data <- tibble(Gt = sample(sprintf("%02d",1:20),nObs,T),
Env = sample(LETTERS\[1:8\],nObs,T)
# The measured plant trait would also go here but is irrelevant for now.
)
What I want is a table which shows me for each genotype in how many environments
it is repeated how many times. For example, If my genotype 01
has two replications
in five environments but only one in another two environments, the output
I expect is:
Genotype nRepl nEnv
---------------------
A 2 5
A 1 2
I have found a possibility using do
, but it is slow because do
expects
a data frame to be created for each data subgroup which contains the summary information. This creates a lot of
overhead, as you can see when profiling:
profvis::profvis(for(1:10) idea1()) #idea1 is defined below
Is there another approach, perhaps without using do(.)
, which gives this result?
My ideas up to now are shown below:
This returns the number of observations I have for each genotype in each environment (=replications):
data %>% count(Gt, Env)
## # A tibble: 158 x 3
## Gt Env n
## <chr> <chr> <int>
## 1 01 A 2
## 2 01 B 1
## 3 01 C 2
## 4 01 D 2
## 5 01 E 6
## 6 01 F 4
## 7 01 G 5
## 8 01 H 3
## 9 02 A 1
## 10 02 B 2
## # ... with 148 more rows
These are the three approaches I could come up with so far. All use do
:
idea1 <- function(){
data %>% count(Gt, Env) %>% group_by(Gt) %>%
do({
t <- table(.$n)
tibble(nRpl = names(t), nEnv = as.integer(t))
})
}
idea2 <- function(){
data %>% count(Gt, Env) %>% group_by(Gt) %>%
do({
t <- table(.$n)
data.frame(nRpl = names(t), nEnv = as.integer(t), stringsAsFactors = FALSE)
})
}
idea3 <- function(){
data %>% count(Gt, Env) %>% group_by(Gt) %>%
do({
t <- table(.$n)
data.frame(nRpl = names(t), nEnv = as.integer(t), stringsAsFactors = FALSE,
check.names = FALSE, check.rows = FALSE)
})
}
This is the output I want:
idea1()
## # A tibble: 93 x 3
## # Groups: Gt [20]
## Gt nRpl nEnv
## <chr> <chr> <int>
## 1 01 1 1
## 2 01 2 3
## 3 01 3 1
## 4 01 4 1
## 5 01 5 1
## 6 01 6 1
## 7 02 1 1
## 8 02 2 4
## 9 02 4 1
## 10 02 6 1
## # ... with 83 more rows
Speed comparison:
library(microbenchmark)
microbenchmark({idea1()},{idea3()},{idea4()}, times = 20)
## Unit: milliseconds
## expr min lq mean median uq max
## { idea1() } 299.21640 373.87930 435.2567 427.8721 460.7488 693.9279
## { idea3() } 80.03300 100.25709 144.8413 124.8140 182.9211 300.3150
## { idea4() } 83.04371 98.18572 122.6464 108.1945 141.0208 216.3999
## neval
## 20
## 20
## 20
Upvotes: 0
Views: 41
Reputation: 43334
You can use count
twice, the second time counting the number of replications calculated the first time:
library(dplyr)
set.seed(47) # for sampling reproducibility
data <- tibble(Gt = sample(sprintf("%02d", 1:20), 500, replace = TRUE),
Env = sample(LETTERS[1:8], 500, replace = TRUE))
data %>%
count(Gt, Env) %>%
count(Gt, n) %>%
rename(nRpl = n, nEnv = nn) # make better names
#> # A tibble: 85 x 3
#> Gt nRpl nEnv
#> <chr> <int> <int>
#> 1 01 2 1
#> 2 01 3 4
#> 3 01 4 3
#> 4 02 2 1
#> 5 02 3 1
#> 6 02 4 3
#> 7 02 6 1
#> 8 03 1 3
#> 9 03 2 1
#> 10 03 3 1
#> # ... with 75 more rows
Upvotes: 1