Reputation: 1456

Best way to analyze correlation between 3 different categorical variables

I'm trying to run some analysis and running into a roadblock (more like a mental block)...

Goal

I have 3 different factor variables:

Cohort: Analyst, Associate, Manager, Sr. Manger, Director, ED, VP
Gender: Male, Female
Timeframe: Mid-Year, Year-End, Beyond

I want to check to see if there is any difference in Gender across Cohort and Timeframe. I.e., are female analysts more likely to fall into Timeframe = "Beyond" than their Male counterparts.

Code

My initial thought is to do something like this:

library(dplyr)
x <- df %>% 
    filter(Gender %in% c("Male","Female")) %>% 
    filter(!is.na("Timeframe")) %>% 
    group_by(Timeframe, Cohort, Gender) %>% 
    summarise(n = n()) %>% 
    mutate(freq = 100 * (n / sum(n)))

But this is giving me percents that don't quite make sense. Ideally I'd like to conclude: "In the Analyst cohort, there is or is not a big difference in the timeframe Year-end or Mid-year or Beyond for gender"

Data

dput(head(df1,30))
structure(list(V1 = c("Female", "Male", "Male", "Male", "Male", 
"Female", "Male", "Female", "Male", "Female", "Male", "Female", 
"Male", "Female", "Female", "Female", "Male", "Female", "Female", 
"Male", "Female", "Female", "Male", "Male", "Female", "Female", 
"Male", "Male", "Female", "Female"), V2 = c("Executive Director", 
"Executive", "Vice President", "Manager", "Director", "Executive Director", 
"Manager", "Senior Manager", "Senior Manager", "Vice President", 
"Director", "Senior Manager", "Manager", "Senior Manager", "Senior Manager", 
"Senior Manager", "Executive Director", "Senior Manager", "Manager", 
"Director", "Senior Manager", "Associate", "Vice President", 
"Senior Manager", "Executive Director", "Manager", "Executive Director", 
"Director", "Associate", "Senior Manager"), V3 = c("Beyond", 
"Beyond", "Beyond", "Beyond", "Beyond", "Mid-Year Promotion", 
"Beyond", "Year End Promotion", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond", 
"Beyond", "Beyond", "Beyond")), row.names = c("1", "2", "4", 
"5", "6", "7", "8", "10", "11", "12", "13", "14", "15", "16", 
"17", "19", "21", "22", "23", "24", "25", "27", "28", "29", "30", 
"31", "32", "33", "34", "35"), class = "data.frame")

Upvotes: 0

Answers (3)

Roman

Reputation: 4989

I'm really a fan of 1 picture == 1000 words, so here are two methods to see what is possible in R visually.

1. Advanced method

This method uses cumulated percentages and cumulated sums with the gganimate and ggplot2 packages. You can play with the parameters (e.g., nframes) to tweak it to your liking.

Code

g <- ggplot(dfcount, aes(x = gender, y = c, fill = timeframe)) +
    geom_col(position = "identity") +
    labs(title = "Gender and Promotion at Goliath National Bank",
         subtitle = "Career level: {closest_state}", 
         x = "Gender",
         y = "Number of employees",
         fill = "Time of promotion") +
    geom_label(aes(y = c, label = text)) +
    scale_fill_manual(values = c("#ABE188", "#F7EF99", "#F1BB87"), 
                      guide = guide_legend(reverse = TRUE)) + 
    transition_states(cohort, transition_length = 1, state_length = 3)
animate(g, nframes = 300)

Data

set.seed(1701)

g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
    "Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")

df <- data.frame(
    gender = factor(sample(g, 1000, c(0.39, 0.61),
        replace = TRUE), levels = g), 
    cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01), 
        replace = TRUE), levels = c),
    timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6), 
        replace = TRUE), levels = t))

library(dplyr)
library(ggplot2)
library(gganimate)
dfcount <- df %>% 
    group_by(gender, cohort, timeframe) %>%           
    summarize(n = n()) %>% 
    mutate(cum = cumsum(n)) %>%
    mutate(perc = n / sum(n)) %>%
    mutate(cumperc = cumsum(perc)) %>%
    mutate(text = paste(round(perc*100, 1), "%"))

dfcount <- dfcount[order(dfcount$cohort, dfcount$gender, desc(dfcount$c)), ]

so that

> head(dfcount)
# A tibble: 6 x 8
# Groups:   gender, cohort [2]
  gender cohort  timeframe     n     c   perc  cperc text  
  <fct>  <fct>   <fct>     <int> <int>  <dbl>  <dbl> <chr> 
1 Female Analyst Beyond       73   126 0.579  1      57.9 %
2 Female Analyst Year-End     48    53 0.381  0.421  38.1 %
3 Female Analyst Mid-Year      5     5 0.0397 0.0397 4 %   
4 Male   Analyst Beyond       95   172 0.552  1      55.2 %
5 Male   Analyst Year-End     70    77 0.407  0.448  40.7 %
6 Male   Analyst Mid-Year      7     7 0.0407 0.0407 4.1 %

2. Simple approach

It can also be as simple as:

Code

plot(table(df$gender, df$timeframe), 
     main = "Gender vs. Timeframe",
     sub = paste("A comparison of the careers of",
         count(subset(df, gender == "Female")), "women and",
         count(subset(df, gender == "Male")), "men"), 
     ylab = "Time of promotion")

Everything after the first line is optional. Obviously you can make this plot much prettier using ggplot2, waffle, or similar.

Data

set.seed(1701)

g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
    "Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")

df <- data.frame(
    gender = factor(sample(g, 1000, c(0.39, 0.61),
        replace = TRUE), levels = g), 
    cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01), 
        replace = TRUE), levels = c),
    timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6), 
        replace = TRUE), levels = t))

so that

> head(df)
  gender    cohort timeframe
1   Male Associate  Year-End
2 Female   Analyst  Year-End
3   Male   Manager    Beyond
4   Male Associate    Beyond
5 Female Associate  Year-End
6   Male   Manager    Beyond

Upvotes: 1

HolgerBarlt

Reputation: 317

Maybe you can just inspect the frequency matrix like:

 table(df1[df1$V1=="Male",2:3])
 table(df1[df1$V1=="Female",2:3])

This gives you an first impression on how your data is distributed. For further investigation you may specify your Null hypotheses a bit more precisely in order to setup the right test. Have a look at at eg. the pearson Chi square test like:

cntTable <- table(df1[,c(1,3)])
chisq.test(cntTable)

Upvotes: 0

Mike Mahoney

Reputation: 33

EJJ is right in his comment - you need to ungroup after the summarise function. Otherwise, you'll be computing group-wise percentages, rather than percentages of the whole.

x=df %>% filter(Gender %in% c('Male',"Female")) %>% 
filter(!is.na(`Promotion Timeframe`)) %>% 
group_by(`Promotion Timeframe`,Management_Level,Gender) %>% 
dplyr::summarise(n=n()) %>% 
ungroup() %>%
mutate(freq = 100* (n/sum(n)))

Upvotes: 1

Best way to analyze correlation between 3 different categorical variables

Goal

Code

Data

Answers (3)

1. Advanced method

Code

Data

2. Simple approach

Code

Data

Related Questions