Reputation: 137
I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.
Here's an example data frame:
CATEGORY ITEM SHOP1 STOCK SHOP2 STOCK
Fruit Orange 5 9
Fruit Apple 12 32
Fruit Pear 17 6
Veg Carrots 59 72
Veg Potatoes 6 57
Veg Courgette 43 22
Veg Parsnips 5 9
... ... ... ...
So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:
SHOP1 SHOP2
FRUIT 34 47
VEG 113 160
Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.
So I thought I could group_by(CATEGORY)
and then run the chi squared test on the grouped data, but that doesn't seem to work. I think I need to add up the two columns with numbers in, but I don't know how to do that in conjunction with the chi squared testing. I've been faffing with this for some time now with no luck, so I'd really appreciate your help!
Upvotes: 0
Views: 85
Reputation: 93871
We can use dplyr
to summarise the data and the tidy
function from the broom
package to return the results of chisq.test
in a data frame:
library(broom)
library(dplyr)
df %>% group_by(CATEGORY) %>%
summarise_at(vars(matches("SHOP")), sum) %>%
do(tidy(chisq.test(.[, grep("SHOP",names(.))])))
statistic p.value parameter method 1 2.566931e-30 1 1 Pearson's Chi-squared test with Yates' continuity correction
Upvotes: 1
Reputation: 11
In the future, it would be helpful if you wrote the code that wasn't working and its output. From what I understand, you are trying to create that table based on the data frame. Is that correct?
This has already been answered pretty well by a previous post: How to sum a variable by group?
From that post, it seems the answer would be:
df %>% group_by(CATEGORY) %>% summarise(SHOP1 = sum(SHOP1), SHOP2 = sum(SHOP2))
Upvotes: 1