Rose
Rose

Reputation: 137

Sum variable by group then run function

I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.

Here's an example data frame:

CATEGORY   ITEM     SHOP1 STOCK   SHOP2 STOCK
 Fruit    Orange         5             9
 Fruit    Apple         12            32
 Fruit     Pear         17             6
  Veg    Carrots        59            72
  Veg    Potatoes        6            57
  Veg   Courgette       43            22
  Veg    Parsnips        5             9
  ...      ...         ...           ...

So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:

          SHOP1 SHOP2
   FRUIT    34    47
     VEG   113   160

Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.

So I thought I could group_by(CATEGORY) and then run the chi squared test on the grouped data, but that doesn't seem to work. I think I need to add up the two columns with numbers in, but I don't know how to do that in conjunction with the chi squared testing. I've been faffing with this for some time now with no luck, so I'd really appreciate your help!

Upvotes: 0

Views: 85

Answers (2)

eipi10
eipi10

Reputation: 93871

We can use dplyr to summarise the data and the tidy function from the broom package to return the results of chisq.test in a data frame:

library(broom)
library(dplyr)

df %>% group_by(CATEGORY) %>%
  summarise_at(vars(matches("SHOP")), sum) %>%
  do(tidy(chisq.test(.[, grep("SHOP",names(.))])))
     statistic p.value parameter                                                       method
1 2.566931e-30       1         1 Pearson's Chi-squared test with Yates' continuity correction

Upvotes: 1

Charlotte Siska
Charlotte Siska

Reputation: 11

In the future, it would be helpful if you wrote the code that wasn't working and its output. From what I understand, you are trying to create that table based on the data frame. Is that correct?

This has already been answered pretty well by a previous post: How to sum a variable by group?

From that post, it seems the answer would be:

df %>% group_by(CATEGORY) %>% summarise(SHOP1 = sum(SHOP1), SHOP2 = sum(SHOP2))

Upvotes: 1

Related Questions