Ron
Ron

Reputation: 25

statistics for 2x4 contingency table with both large and small counts

I apologize if this is a very naive question...

I have 7000 2x4 contingency tables with count data. They represent a particular position in a genome and the number of times each dna nucleotide is observed at that position in 2 different environments. an example contingency table would be

            A      C      G      T 
condition1  0      2      20     70000
condition2  3      15     0      95000

or
            A      C     G       T 
condition1  80146  0     5       0
condition2  26821  2     4       0

The data can only be positive integers. Minimum counts are 0 and maximum can go up to ~800,000. One count is generally nearly all of the total counts for that row and column (e.g. the same in both conditions, for example cell T in the first case above and cell A in the second), and then 1 or 2 other cells will have low counts... it is in these other cells where the difference, if any, should be observed.

The goal is to identify the positions which are significantly different between these 2 environmental conditions to further analyze. Our measurement method is estimated to have an error rate of 10^-6.

I am using R to analyze this data. I am not sure I can run a chi square test on this because of having cells with small or 0 counts. With the fisher's test I get 2 errors:

with a workspace of 1E5 
FEXACT error 40.
Out of workspace.

with a workspace of >3E5
FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.

Can anyone suggest an appropriate test, or setting for the fisher or chi square?

Many thanks in advance,

Ron

Upvotes: 2

Views: 2528

Answers (2)

Kai Sun
Kai Sun

Reputation: 1

Fisher's exact test in R only works on smaller data. If you reduce the data in column of T from 70000 and 95000 to 700 and 950, the Fisher test will work.

Meanwhile, I tried chisq.test on your data and it worked. For larger data, chi-square test is preferred over Fisher's exact test.

Upvotes: 0

rnso
rnso

Reputation: 24613

Chi-square test works:

df1 = structure(list(A = c(0L, 3L), C = c(2L, 15L), G = c(20L, 0L), 
    T = c(70000L, 95000L)), .Names = c("A", "C", "G", "T"), class = "data.frame", row.names = 1:2)

df1
  A  C  G     T
1 0  2 20 70000
2 3 15  0 95000

chisq.test(df1)

        Pearson's Chi-squared test

data:  df1
X-squared = 35.8943, df = 3, p-value = 7.884e-08

Warning message:
In chisq.test(df1) : Chi-squared approximation may be incorrect

I am not sure if this is sufficient.

Upvotes: 0

Related Questions