Reputation: 21
Is it statistically correct/viable to run an A/B test where the A/B group split is per user and then the statistic is aggregated per item?
Lets narrow down the issue into a specific example:
Example data:
Addon click-through target: 5%
Count items that reached target | Count items that didn't reach target | |
---|---|---|
group A | 5216 | 1295 |
group B | 5558 | 953 |
Fisher's exact p-value is less than 0.0001 -> results statistically significant for alpha=0.05.
My concern is that such methodology (group split by users, aggregation by items) violates some assumptions of AB test design and theory. We ran 500 AA fisher exact tests with alpha=0.05 and out of those 500 simulations only 0.012 were statistically significant.
I tried looking online for articles that employed such methodology but I was unable to find relevant sources given the overflow of "AB test tutorials" (maybe my search skills suck). I asked GenAI and the model doesn't seem to have a problem with such approach but... it's GenAI.
Can anyone elaborate on this? Any relevant sources or links?
Upvotes: 0
Views: 18