Joe
Joe

Reputation: 121

Create Multiple 2-dimensional Tables from Multiple Columns in R Using dplyr

I'm looking for an efficient way to create multiple 2-dimension tables from an R dataframe of chi-square statistics. The code below builds on this answer to a previous question of mine about getting chi-square stats by groups. Now I want to create tables from the output by group. Here's what I have so far using the hsbdemo data frame from the UCLA R site:

ml <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")

str(ml)

'data.frame':   200 obs. of  13 variables:
 $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
 $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
 $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
 $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
 $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...

ml %>% 
  dplyr::select(prog, ses, schtyp) %>%
  table() %>%
  apply(3, chisq.test, simulate.p.value = TRUE) %>%
  lapply(`[`, c(6,7,9)) %>%
  reshape2::melt() %>%
  tidyr::spread(key = L2, value = value) %>%
  dplyr::rename(SchoolType = L1) %>%
  dplyr::arrange(SchoolType, prog) %>%
  dplyr::select(-observed, -expected) %>%
  reshape2::acast(., prog ~ ses ~ SchoolType ) %>%
  tbl_df()  

The output after the last arrange statement produces this tibble (showing only the first five rows):

       prog    ses SchoolType expected observed     stdres
1   general    low    private  0.37500        2  3.0404678
2   general middle    private  3.56250        3 -0.5187244
3   general   high    private  2.06250        1 -1.0131777
4  academic    low    private  1.50000        0 -2.5298221
5  academic middle    private 14.25000       14 -0.2078097

It's easy to select one column, for example, stdres, and pass it to acast and tbl_df, which gets pretty much what I'm after:

# A tibble: 3 x 6
  low.private middle.private high.private low.public middle.public high.public
        <dbl>          <dbl>        <dbl>      <dbl>         <dbl>       <dbl>
1       3.04          -0.519        -1.01      1.47         -0.236       -1.18
2      -2.53          -0.208         1.50     -0.940        -2.06         3.21
3      -0.377          1.21         -1.06     -0.331         2.50        -2.45

Now I can repeat these steps for observed and expected frequencies and bind them by rows, but that seems inefficient. The output would observed frequencies stacked on expected, stacked on the standardized residuals. Something like this:

 low.private middle.private high.private low.public middle.public high.public
        <dbl>          <dbl>        <dbl>      <dbl>         <dbl>       <dbl>
1       2              3            1         14            17            8   
2       0             14           10         19            30           32   
3       0              2            0         12            29            7   
4       0.375          3.56         2.06      10.4          17.6         10.9 
5       1.5           14.2          8.25      21.7          36.6         22.7 
6       0.125          1.19         0.688     12.9          21.7         13.4 
7       3.04          -0.519       -1.01       1.47         -0.236       -1.18
8      -2.53          -0.208        1.50      -0.940        -2.06         3.21
9      -0.377          1.21        -1.06      -0.331         2.50        -2.45

Seems there ought to be a way to do this without repeating the code for each column, probably by creating and processing a list. Thanks in advance.

Upvotes: 0

Views: 838

Answers (1)

Ott Toomet
Ott Toomet

Reputation: 1956

Might this be the answer?

   ml1 <- ml %>% 
      dplyr::select(prog, ses, schtyp) %>%
      table() %>%
      apply(3, chisq.test, simulate.p.value = TRUE) %>%
      lapply(`[`, c(6,7,9)) %>%
      reshape2::melt()
   ml2 <- ml1 %>%
      dplyr::mutate(type=paste(ses, L1, sep=".")) %>%
      dplyr::select(-ses, -L1) %>%
      tidyr::spread(type, value)

This gives you

      prog       L2 high.private high.public low.private low.public middle.private middle.public
1  general expected     2.062500   10.910714   0.3750000 10.4464286      3.5625000    17.6428571
2  general observed     1.000000    8.000000   2.0000000 14.0000000      3.0000000    17.0000000
3  general   stdres    -1.013178   -1.184936   3.0404678  1.4663681     -0.5187244    -0.2360209
4 academic expected     8.250000   22.660714   1.5000000 21.6964286     14.2500000    36.6428571
5 academic observed    10.000000   32.000000   0.0000000 19.0000000     14.0000000    30.0000000
6 academic   stdres     1.504203    3.212431  -2.5298221 -0.9401386     -0.2078097    -2.0607058
7 vocation expected     0.687500   13.428571   0.1250000 12.8571429      1.1875000    21.7142857
8 vocation observed     0.000000    7.000000   0.0000000 12.0000000      2.0000000    29.0000000
9 vocation   stdres    -1.057100   -2.445826  -0.3771236 -0.3305575      1.2081594     2.4999085

I am not sure I understand completely what you are out after... But basically, create a new variable of SES and school type, and gather based on that. And obviously, reorder it as you wish :-)

Upvotes: 1

Related Questions