Reputation: 1456
I'm trying to run some analysis and running into a roadblock (more like a mental block)...
I have 3 different factor variables:
Analyst
, Associate
, Manager
, Sr. Manger
, Director
, ED
, VP
Male
, Female
Mid-Year
, Year-End
, Beyond
I want to check to see if there is any difference in Gender
across Cohort
and Timeframe
. I.e., are female analysts more likely to fall into Timeframe = "Beyond"
than their Male counterparts.
My initial thought is to do something like this:
library(dplyr)
x <- df %>%
filter(Gender %in% c("Male","Female")) %>%
filter(!is.na("Timeframe")) %>%
group_by(Timeframe, Cohort, Gender) %>%
summarise(n = n()) %>%
mutate(freq = 100 * (n / sum(n)))
But this is giving me percents that don't quite make sense. Ideally I'd like to conclude: "In the Analyst cohort, there is or is not a big difference in the timeframe Year-end or Mid-year or Beyond for gender"
dput(head(df1,30))
structure(list(V1 = c("Female", "Male", "Male", "Male", "Male",
"Female", "Male", "Female", "Male", "Female", "Male", "Female",
"Male", "Female", "Female", "Female", "Male", "Female", "Female",
"Male", "Female", "Female", "Male", "Male", "Female", "Female",
"Male", "Male", "Female", "Female"), V2 = c("Executive Director",
"Executive", "Vice President", "Manager", "Director", "Executive Director",
"Manager", "Senior Manager", "Senior Manager", "Vice President",
"Director", "Senior Manager", "Manager", "Senior Manager", "Senior Manager",
"Senior Manager", "Executive Director", "Senior Manager", "Manager",
"Director", "Senior Manager", "Associate", "Vice President",
"Senior Manager", "Executive Director", "Manager", "Executive Director",
"Director", "Associate", "Senior Manager"), V3 = c("Beyond",
"Beyond", "Beyond", "Beyond", "Beyond", "Mid-Year Promotion",
"Beyond", "Year End Promotion", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond",
"Beyond", "Beyond", "Beyond")), row.names = c("1", "2", "4",
"5", "6", "7", "8", "10", "11", "12", "13", "14", "15", "16",
"17", "19", "21", "22", "23", "24", "25", "27", "28", "29", "30",
"31", "32", "33", "34", "35"), class = "data.frame")
Upvotes: 0
Views: 202
Reputation: 4989
I'm really a fan of 1 picture == 1000 words
, so here are two methods to see what is possible in R visually.
This method uses cumulated percentages and cumulated sums with the gganimate
and ggplot2
packages. You can play with the parameters (e.g., nframes
) to tweak it to your liking.
g <- ggplot(dfcount, aes(x = gender, y = c, fill = timeframe)) +
geom_col(position = "identity") +
labs(title = "Gender and Promotion at Goliath National Bank",
subtitle = "Career level: {closest_state}",
x = "Gender",
y = "Number of employees",
fill = "Time of promotion") +
geom_label(aes(y = c, label = text)) +
scale_fill_manual(values = c("#ABE188", "#F7EF99", "#F1BB87"),
guide = guide_legend(reverse = TRUE)) +
transition_states(cohort, transition_length = 1, state_length = 3)
animate(g, nframes = 300)
set.seed(1701)
g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
"Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")
df <- data.frame(
gender = factor(sample(g, 1000, c(0.39, 0.61),
replace = TRUE), levels = g),
cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01),
replace = TRUE), levels = c),
timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6),
replace = TRUE), levels = t))
library(dplyr)
library(ggplot2)
library(gganimate)
dfcount <- df %>%
group_by(gender, cohort, timeframe) %>%
summarize(n = n()) %>%
mutate(cum = cumsum(n)) %>%
mutate(perc = n / sum(n)) %>%
mutate(cumperc = cumsum(perc)) %>%
mutate(text = paste(round(perc*100, 1), "%"))
dfcount <- dfcount[order(dfcount$cohort, dfcount$gender, desc(dfcount$c)), ]
so that
> head(dfcount)
# A tibble: 6 x 8
# Groups: gender, cohort [2]
gender cohort timeframe n c perc cperc text
<fct> <fct> <fct> <int> <int> <dbl> <dbl> <chr>
1 Female Analyst Beyond 73 126 0.579 1 57.9 %
2 Female Analyst Year-End 48 53 0.381 0.421 38.1 %
3 Female Analyst Mid-Year 5 5 0.0397 0.0397 4 %
4 Male Analyst Beyond 95 172 0.552 1 55.2 %
5 Male Analyst Year-End 70 77 0.407 0.448 40.7 %
6 Male Analyst Mid-Year 7 7 0.0407 0.0407 4.1 %
It can also be as simple as:
plot(table(df$gender, df$timeframe),
main = "Gender vs. Timeframe",
sub = paste("A comparison of the careers of",
count(subset(df, gender == "Female")), "women and",
count(subset(df, gender == "Male")), "men"),
ylab = "Time of promotion")
Everything after the first line is optional. Obviously you can make this plot much prettier using ggplot2
, waffle
, or similar.
set.seed(1701)
g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
"Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")
df <- data.frame(
gender = factor(sample(g, 1000, c(0.39, 0.61),
replace = TRUE), levels = g),
cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01),
replace = TRUE), levels = c),
timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6),
replace = TRUE), levels = t))
so that
> head(df)
gender cohort timeframe
1 Male Associate Year-End
2 Female Analyst Year-End
3 Male Manager Beyond
4 Male Associate Beyond
5 Female Associate Year-End
6 Male Manager Beyond
Upvotes: 1
Reputation: 317
Maybe you can just inspect the frequency matrix like:
table(df1[df1$V1=="Male",2:3])
table(df1[df1$V1=="Female",2:3])
This gives you an first impression on how your data is distributed. For further investigation you may specify your Null hypotheses a bit more precisely in order to setup the right test. Have a look at at eg. the pearson Chi square test like:
cntTable <- table(df1[,c(1,3)])
chisq.test(cntTable)
Upvotes: 0
Reputation: 33
EJJ is right in his comment - you need to ungroup after the summarise function. Otherwise, you'll be computing group-wise percentages, rather than percentages of the whole.
x=df %>% filter(Gender %in% c('Male',"Female")) %>%
filter(!is.na(`Promotion Timeframe`)) %>%
group_by(`Promotion Timeframe`,Management_Level,Gender) %>%
dplyr::summarise(n=n()) %>%
ungroup() %>%
mutate(freq = 100* (n/sum(n)))
Upvotes: 1