Reputation: 10160
Lets say I have a dataframe that looks like this:
groups <- floor(runif(1000, min=1, max=5))
activity <- rep(c("A1", "A2", "A3", "A4"), times= 250)
endorsement <- floor(runif(1000, min=0, max=2))
value1 <- runif(1000, min=1, max=10)
area <- rep(c("A", "A", "A", "A", "B", "C", "C", "D", "D", "E"), times = 100)
df <- data.frame(groups, activity, endorsement, value1, area)
printed:
> head(df)
groups activity endorsement value1 area
1 1 A1 0 7.443375 A
2 1 A2 0 4.342376 A
3 1 A3 0 4.810690 A
4 4 A4 0 3.494974 A
5 3 A1 1 6.442354 B
6 1 A2 0 9.794138 C
I want to calculate some descriptive statistics and create some bar charts, but if you look at the area
variable, A
is very well represented, whereas B
and E
are not.
I'm not interested in the area
variable itself, but the stats/plot will be driven by areas that have high representation in the dataset, so I need to weight the data but I'm not sure the correct way to do it in the following situations:
Mean and SD
I'm calculating the mean and SD or value1
as follows:
df %>% group_by(groups) %>% summarise(mean=mean(value1), sd=sd(value1))
Whats the correct way to calculate a weighted mean/sd to compensate for differences in sample size for each area (i.e. I want to give each area
equal weight)?
Stacked bar chart
ggplot(df, aes(groups)) +
geom_bar(aes(fill = activity), position = position_fill(reverse = F))
The bars represent the proportions for how often each activity
occured in each group
. Again, this is driven mostly by respondents from area A - is there a way to balance this and calculate proportions as if area
has equal representation?
Grouped means
ggplot(aes(x = activity, y = value1, fill=factor(groups)), data=df) +
geom_bar(position="dodge", stat="summary", fun.y="mean")+
guides(fill = guide_legend(reverse=F, title="group"))
The bars represent the average of value1
for each group
and activity
combination. Again, these averages are weighted in favour of Area A, and representation is not equal
Grouped count proportions
summary_df <- df %>% group_by(groups, activity) %>%
summarise(n=n(), count=sum(endorsement)) %>% mutate(prop=(count/n)*100)
ggplot(aes(x = activity, y = prop, fill = factor(groups)), data=summary_df) +
geom_bar(width=0.8, position = position_dodge(width=0.8), stat="identity") +
guides(fill = guide_legend(reverse=F, title="group"))
For each group
and activity
combination, I'm counting the number of people that endorsed the item (responded 1
), and calculating a proportion of total people in the subgroup
The 4 problems above all stem from the same problem and all need to be weighted by area
to create equal representation. However, the visualizations are all created differently and showing different things (means, stacked bars, grouped means, count proportions) and I'm not sure the correct way to account for sample size differences in each case. Is there a single fix that will propagate to each of the graph examples?
Upvotes: 1
Views: 2688
Reputation: 24198
One strategy would be to down- or up-sample your dataframe
so that each area has the same number of observations. We can use the convenience functions downSample()
or upSample()
from the caret
package, which according to the documentation:
"Simple random sampling is used to down-sample for the majority class(es). Note that the minority class data are left intact..."
To illustrate:
library(dpyr)
library(caret)
# Before
df %>% group_by(area) %>% summarise(n())
# area `n()`
#1 A 400
#2 B 100
#3 C 200
#4 D 200
#5 E 100
# After
set.seed(123)
test_down <- downSample(df, df$area)
test_down %>% group_by(area) %>% summarise(n())
# area `n()`
#1 A 100
#2 B 100
#3 C 100
#4 D 100
#5 E 100
test_up <- upSample(df, df$area)
test_up %>% group_by(area) %>% summarise(n())
# area `n()`
#1 A 400
#2 B 400
#3 C 400
#4 D 400
#5 E 400
So then your first graph becomes:
library(ggplot2)
ggplot(test_down, aes(groups)) +
geom_bar(aes(fill = activity),
position = position_fill(reverse = F))
Note that because we use random sampling, we have no control over which observations get omitted when using downSample()
. Hence, the results might look slightly different at each run without set.seed()
.
Upvotes: 1