How to create a contingency tables (crosstab) for two categorical variables?

Question

The WHO DownSyndrome database gives you information about the number of down syndrome babies born in 1M babies for different variables.

I wanted to create a cross tab with the absolute, relative, and marginal frequencies for the categorical variables Country and Year

The desire results would look something like:


Year    |      1978           1979            1980      
Country |   ----------------------------------------------------------------------     
ALB     |      76.4             78.5              87.6            ...
AUT     |      87.6             67.9              80.0            …
AZE     |      90.5             78.5              64.8            …

I tried a couple of things:

First, creating a new df for the three variables:

DS <- dplyr::select(DownSyndrome, YEAR, COUNTRY_REGION, VALUE)

But I can´t figure out how to use the table, and cross functions with that.

I managed to get the absolute frequencies with:


DS_wide <- DownSyndrome %>%
  pivot_wider(
    names_from = YEAR,
    values_from = VALUE)

But then again, I dont know how to get the relative and margin frequencies from here.

Any help is appreciated, Thanks

PS:

Link to the database

Ian Campbell · Accepted Answer

Here's a dplyr approach.

The first line converts the long form to wide form based on your code. The second line calculates the totals based on all of the columns with names that match the pattern of only numbers. The third line calculates the relative total for each country. The fourth line divides all of the columns with only numbers by the total column and assigns them to columns with a suffix of _RELATIVE.

library(tidyverse)
DownSyndrome %>%
  pivot_wider(names_from = YEAR,values_from = VALUE) %>%
  mutate(TOTAL = rowSums(.[grep("[0-9]+",names(.))],na.rm=TRUE)) %>%
  mutate(TOTAL_RELATIVE = TOTAL / sum(TOTAL)) %>%
  mutate_at(vars(matches("[0-9]+")), list(RELATIVE = ~ ./TOTAL))
# A tibble: 49 x 104
#   COUNTRY COUNTRY_GRP SEX   `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017` `1985` `1986` `1987` `1988`
#                                         
# 1 ALB     ""          ALL       32     29     42     39     18     28     14     29     23     16     NA     NA     NA     NA
# 2 AUT     ""          ALL       13     16     12      6      6      3     NA     NA     NA     NA     23     29     39     24
# 3 AZE     ""          ALL       38     51     52     42     32     38     49     36     32     38     NA     NA     NA     NA
# 4 BEL     ""          ALL      248    232    275    240    259    245     NA     NA     NA     NA     56    126    152    126
# 5 BGR     ""          ALL       22     32     27     32     38     17     32     32     18     24     NA     NA     NA     NA
# 6 BIH     ""          ALL       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     59     42     75     65
# 7 BLR     ""          ALL       69     60     44     61     46     50     56     53     63     56     NA     NA     NA     NA
# 8 CHE     ""          ALL      121    107    127    145    152    152    131    108    112     86     NA     NA     NA     NA
# 9 CZE     ""          ALL       43     49     46     53     49     45     45     43     NA     NA     NA     NA     NA    116
#10 DEU     ""          ALL      269    288    251    290    298    284    290    296    308    312     NA     NA     NA     NA

If you only want the relative values, you could use select.

DownSyndrome %>%
  pivot_wider(names_from = YEAR,values_from = VALUE) %>%
  mutate(TOTAL = rowSums(.[grep("[0-9]+",names(.))],na.rm=TRUE)) %>%
  mutate(TOTAL_RELATIVE = TOTAL / sum(TOTAL)) %>%
  mutate_at(vars(matches("[0-9]+")), list(RELATIVE = ~ ./TOTAL)) %>%
  select(COUNTRY,matches("RELATIVE"),TOTAL_RELATIVE)

Data

DownSyndrome <- read.csv("~/HFA_604_EN.csv",skip = 25)

How to create a contingency tables (crosstab) for two categorical variables?

Answers (1)

Related Questions