christinelly
christinelly

Reputation: 63

R (tidyverse)- Columns sums for aggregated data for 2 categorical variables for Chi square test of independence?

Could someone please kindly give me their advice?

  1. I am looking to sum my column totals.
  2. I need the frame for a Chi square test of independence, so if there is a faster way please enlighten me!

What is the best way to do this?

I tried with ColSums but it gave me an error (Error in colSums(., mpaa_rating, na.rm = FALSE, dims = 1) : unused argument (mpaa_rating). I was evidently not using it correctly or entering it at the right place. I tried: colSums (mpaa_rating, na.rm = FALSE, dims = 1) %>% just above spread.

Many thanks in advance, Christine

rereprex::reprex_info() 
movie_help<- data.frame(tribble(
             ~mpaa_rating,                       ~genre,
                     "PG",         "Action & Adventure",
                      "R",         "Mystery & Suspense",
                      "R",                      "Drama",
                      "R",                      "Drama",
                      "R",                      "Drama",
                     "PG",         "Action & Adventure",
                  "PG-13",                     "Comedy",
                      "R",                     "Comedy",
                      "R",         "Action & Adventure",
                      "R",                      "Drama",
                      "R",                      "Drama",
                      "G",                      "Drama",
                      "R",                     "Comedy",
                      "R",                      "Drama",
                      "R",         "Mystery & Suspense",
                      "R",  "Musical & Performing Arts",
                "Unrated",                      "Drama",
                      "R",                      "Drama",
                  "PG-13",                      "Drama",
                  "PG-13",                      "Drama"
             )) 
movie_help %>% 
filter(!is.na(genre), !is.na(mpaa_rating)) %>% 
count(genre, mpaa_rating) %>%
group_by(genre) %>%
mutate(prop = n) %>%
mutate(Total= sum(n)) %>%
select(-n) %>%
spread(key = mpaa_rating, value = prop) 
#> # A tibble: 5 x 7
#> # Groups:   genre [5]
#>                       genre Total     G    PG `PG-13`     R Unrated
#> *                     <chr> <int> <int> <int>   <int> <int>   <int>
#> 1        Action & Adventure     3    NA     2      NA     1      NA
#> 2                    Comedy     3    NA    NA       1     2      NA
#> 3                     Drama    11     1    NA       2     7       1
#> 4 Musical & Performing Arts     1    NA    NA      NA     1      NA
#> 5        Mystery & Suspense     2    NA    NA      NA     2      NA

Upvotes: 2

Views: 907

Answers (2)

Guilherme Marthe
Guilherme Marthe

Reputation: 1124

To get the sum at the bottom, I like to use the janitor::adorn_totals function from the janitor package. The janitor package has many little helper functions for situations where you want to clean tables in the way you want. Check more about it here. My favorite is also the janitor::clean_names which helps you sanitize column names uniformly.

Now in your case you can simply:

 movie_help %>% 
    filter(!is.na(genre), !is.na(mpaa_rating)) %>% 
    count(genre, mpaa_rating) %>% 
    group_by(genre) %>%
    mutate(prop = n) %>%
    mutate(Total= sum(n)) %>%  
    select(-n) %>%
    spread(key = mpaa_rating, value = prop, fill = 0) %>% 
    janitor::adorn_totals('row') %>% 
    janitor::clean_names() 

Upvotes: 4

bouncyball
bouncyball

Reputation: 10781

We can use table and chisq.test to perform the test you want:

chisq.test(table(movie_help))

We can also manually calculate the totals:

dat <- movie_help %>%
  filter(!is.na(genre),!is.na(mpaa_rating)) %>%
  count(genre, mpaa_rating) %>%
  group_by(genre) %>%
  mutate(prop = n) %>%
  mutate(Total = sum(n)) %>%
  select(-n) %>%
  spread(key = mpaa_rating, value = prop) 

bind_rows(dat, 
          cbind(data_frame('genre' = 'Total'), summarise_all(dat[,-1], sum, na.rm = T)))

  genre                     Total     G    PG `PG-13`     R Unrated
  <chr>                     <int> <int> <int>   <int> <int>   <int>
1 Action & Adventure            3    NA     2      NA     1      NA
2 Comedy                        3    NA    NA       1     2      NA
3 Drama                        11     1    NA       2     7       1
4 Musical & Performing Arts     1    NA    NA      NA     1      NA
5 Mystery & Suspense            2    NA    NA      NA     2      NA
6 Total                        20     1     2       3    13       1

Upvotes: 0

Related Questions