Reputation: 21
I have problem with a t.test. I would like to study the differential expression of every of my protein in multiple case. To do this I wanted to do a t.test.
I build a dataset (just the head) :
Protein Norm Cultivar Strain CxS rep Treatment
1 TraesCS1A01G002200.1 8.955735 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
2 TraesCS1A01G005600.3 7.241294 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
3 TraesCS1A01G007400.1 7.009850 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
4 TraesCS1A01G007500.1 9.291952 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
5 TraesCS1A01G007700.1 7.988990 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
6 TraesCS1A01G009000.3 7.705927 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
The link of the full dataset is : https://github.com/sangui1216/protein-analysis It is the only csv.
I wanted to do a t.test on this data set, I did :
test%>%group_by(Protein) %>% do(tidy(t.test(Norm ~ Strain,data = .))) ->t_test_result
And at 48% I have this error message.
Error in t.test.default(x = c(7.64221451305547, 7.64221451305547, 7.64221451305547, :
data are essentially constant
Can you help me ? My data set doesn't look like constant to me (but I have to many data to look one by one). Can I force the t.test to just ignore this part ?
Thank you very much,
Sangui
Thanks to the answers ! It works.
Upvotes: 2
Views: 122
Reputation: 5429
It is indeed constant, have a look here:
library(data.table)
dat <- fread("https://github.com/sangui1216/protein-analysis/raw/main/test1.csv")
dat[ dat[, .I[var(Norm) == 0], by = Protein]$V1]
> dat[ dat[, .I[var(Norm) == 0], by = Protein]$V1]
V1 Protein Norm Cultivar Strain CxS rep Treatment
1: 2499 TraesCS4A01G239100.2 7.642215 Cadenza MDC_FG1 Cadenza_MDC_FG1 1 Fusa
2: 7657 TraesCS4A01G239100.2 7.642215 Cadenza MDC_FG1 Cadenza_MDC_FG1 2 Fusa
3: 12815 TraesCS4A01G239100.2 7.642215 Cadenza MDC_FG1 Cadenza_MDC_FG1 3 Fusa
4: 17973 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FU01 1 Fusa
5: 23131 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FU01 2 Fusa
6: 28289 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FU01 3 Fusa
7: 33447 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FG13 1 Fusa
8: 38605 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FG13 2 Fusa
9: 43763 TraesCS4A01G239100.2 7.642215 Cadenza control Cadenza_MDC_FG13 3 Fusa
10: 48921 TraesCS4A01G239100.2 7.642215 aRecital MDC_FG1 Recital_MDC_FG1 1 Fusa
11: 54079 TraesCS4A01G239100.2 7.642215 aRecital MDC_FG1 Recital_MDC_FG1 2 Fusa
12: 59237 TraesCS4A01G239100.2 7.642215 aRecital MDC_FG1 Recital_MDC_FG1 3 Fusa
13: 64395 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FU01 1 Fusa
14: 69553 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FU01 2 Fusa
15: 74711 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FU01 3 Fusa
16: 79869 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FG13 1 Fusa
17: 85027 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FG13 2 Fusa
18: 90185 TraesCS4A01G239100.2 7.642215 aRecital control Recital_MDC_FG13 3 Fusa
19: 95343 TraesCS4A01G239100.2 7.642215 Renan MDC_FG1 Renan_MDC_FG1 1 Fusa
20: 100501 TraesCS4A01G239100.2 7.642215 Renan MDC_FG1 Renan_MDC_FG1 2 Fusa
21: 105659 TraesCS4A01G239100.2 7.642215 Renan MDC_FG1 Renan_MDC_FG1 3 Fusa
22: 110817 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FU01 1 Fusa
23: 115975 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FU01 2 Fusa
24: 121133 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FU01 3 Fusa
25: 126291 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FG13 1 Fusa
26: 131449 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FG13 2 Fusa
27: 136607 TraesCS4A01G239100.2 7.642215 Renan control Renan_MDC_FG13 3 Fusa
Just filter away that protein, and run again.
Using data.table notation you can easily chain upon this to run the t.tests too, should take only a few seconds:
library(data.table)
library(broom)
dat <- fread("https://github.com/sangui1216/protein-analysis/raw/main/test1.csv")
system.time( t.test.results <- dat[ dat[, .I[var(Norm) > 0], by = Protein]$V1 ][
, tidy( t.test( Norm ~ Strain ) ), by = Protein
] )
t.test.results[ order(p.value) ]
user system elapsed
2.922 0.000 2.833
Protein estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
1: TraesCS2A01G315600.1 -3.602881e-01 8.180318 8.540606 -8.856018988 2.590483e-08 19.75920 -0.4452173 -0.27535878 Welch Two Sample t-test two.sided
2: TraesCS1B01G107600.1 -2.154759e-01 7.944446 8.159922 -7.939083661 2.919241e-08 24.72827 -0.2714053 -0.15954648 Welch Two Sample t-test two.sided
3: TraesCS2B01G333900.1 -4.974910e-01 7.755210 8.252701 -7.648981619 1.725526e-07 20.91120 -0.6327845 -0.36219763 Welch Two Sample t-test two.sided
4: TraesCS6D01G302300.2 -3.101279e-01 8.955777 9.265905 -7.319162710 3.734485e-07 20.60395 -0.3983486 -0.22190713 Welch Two Sample t-test two.sided
5: TraesCS1B01G090600.1 -2.073122e-01 9.016683 9.223996 -7.629576322 8.048855e-07 16.60394 -0.2647448 -0.14987957 Welch Two Sample t-test two.sided
---
5153: TraesCS1A01G250300.1 1.772189e-04 8.506596 8.506419 0.003487947 9.972537e-01 18.83739 -0.1062293 0.10658371 Welch Two Sample t-test two.sided
5154: TraesCS4A01G042600.1 -7.410061e-04 7.107547 7.108288 -0.003026521 9.976295e-01 13.47675 -0.5277846 0.52630255 Welch Two Sample t-test two.sided
5155: TraesCS3A01G275700.2 -1.007336e-04 8.006658 8.006758 -0.002819725 9.977826e-01 17.24193 -0.0753926 0.07519114 Welch Two Sample t-test two.sided
5156: TraesCS4B01G293000.1 2.161934e-04 7.310060 7.309844 0.001284415 9.989894e-01 17.90796 -0.3535424 0.35397481 Welch Two Sample t-test two.sided
5157: TraesCS1B01G059800.2 -7.504112e-05 7.212239 7.212314 -0.001240131 9.990325e-01 11.09887 -0.1331134 0.13296331 Welch Two Sample t-test two.sided
Upvotes: 1
Reputation: 76621
Here is a way of running the code until the end with tryCatch
.
library(tidyverse)
library(generics)
test %>%
group_by(Protein) %>%
do(tidy(
tryCatch(
t.test(Norm ~ Strain, data = .),
error = function(e) NA
)
)) -> t_test_result
head(t_test_result)
## A tibble: 6 x 12
## Groups: Protein [6]
# Protein estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative x
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <lgl>
#1 TraesCS1A01G0… -0.0285 8.70 8.73 -0.302 0.767 14.8 -0.230 0.173 Welch Two Sa… two.sided NA
#2 TraesCS1A01G0… 0.0429 7.04 7.00 0.307 0.765 9.78 -0.270 0.356 Welch Two Sa… two.sided NA
#3 TraesCS1A01G0… -0.188 5.84 6.02 -0.444 0.663 17.1 -1.08 0.705 Welch Two Sa… two.sided NA
#4 TraesCS1A01G0… 0.0476 9.33 9.29 1.96 0.0616 24.9 -0.00250 0.0978 Welch Two Sa… two.sided NA
#5 TraesCS1A01G0… 0.209 8.09 7.88 1.62 0.137 10.2 -0.0787 0.498 Welch Two Sa… two.sided NA
#6 TraesCS1A01G0… 0.0673 8.05 7.99 1.00 0.334 12.3 -0.0784 0.213 Welch Two Sa… two.sided NA
To know what is the Protein
with all values equal, run
which(tapply(test$Norm, test$Protein, var) == 0)
#TraesCS4A01G239100.2
# 2499
Upvotes: 0