striatum
striatum

Reputation: 1598

How to create 2-way contingency tables in R

(Disclaimer: With apologies, I did ask a similar question previously. It did not get any answer, and it was closed.)

I would like to create 2-way contingency tables of different sizes, say, from 3x3 to 10x15, where some should show significant association (using chisq.test() or similar) and some don't. I've ran to bits-and-peaces of potentially relevant posts, but I do not see the way to connect all the dots. For example, there is this post that discusses how to create random 2-way tables with r2dtable(). Next, there are posts about generating random integers that sum-up to a particular value, here and here, which could be useful to define row and column marginals for r2dtable().

Nevertheless, it escapes me how to generate a list of such tables. Also, it seems that r2dtable() always return tables that show no association. I suppose this is to be expected given that the tables are random.

Can anyone help, please?

Upvotes: 0

Views: 576

Answers (1)

Limey
Limey

Reputation: 12461

The missing piece of information in your question is how to define the association - or lack-of-asociation - in your tables. That's going to be a case specific part of any generic solution.

I assume that the "table" you want to end up analysing consists of summarised data, classified by two factors.

generateData <- function(nRow, nCol, f, ...) {
  df <- tibble() %>% 
          expand(
            Row=1:nRow,
            Col=1:nCol
          )
  df <- df %>% 
          f(...)  %>% 
          pivot_wider(
            names_from=Col,
            values_from=Value,
            names_prefix="Col"
          )
  return(df)
}

Here, nCol and nRow have the obvious meanings and f is a function that has to be defined and which populates a column named Value in a long tibble with columns named Row and Col. The elipsis ..., allows you to pass arbitrary additional arguments to f if needed.

To generate a table with no association between either rows and columns, simply fill Value with random data. For example:

randomCells <- function(df, ...){
  df %>% mutate(Value=5 + floor(runif(df %>% nrow(), max=10)))
}

So that

x <- generateRawData(3, 5, randomCells)
x
# A tibble: 3 x 6
    Row  Col1  Col2  Col3  Col4  Col5
  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     9    11    11    14     8
2     2    13    11    12     5    14
3     3     8    11    14    13    10

and

chisq.test(as.matrix(x))

    Pearson's Chi-squared test

data:  as.matrix(x)
X-squared = 8.8907, df = 10, p-value = 0.5425

Now suppose you want a linear trend across columns, but no association between rows:

linearColumns <- function(df, ...){
  df %>% mutate(Value=4*Col + floor(runif(df %>% nrow(), max=25)))
}

x <- generateRawData(3, 6, linearColumns)
x
# A tibble: 3 x 6
    Row  Col1  Col2  Col3  Col4  Col5
  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    20    12    31    24    46
2     2     9    25    39    46    38
3     3     6    20    35    36    49

giving

chisq.test(as.matrix(x))

    Pearson's Chi-squared test

data:  as.matrix(x)
X-squared = 22.63, df = 10, p-value = 0.0122

You just need to define f to give the pattern you want. In more complicated cases, it might be easier to define response at the level of the experimental unit and then aggregate the observed data to form your simmary data.

Apologies, I forgot to set.seed() before generating my examples.

Upvotes: 1

Related Questions