Reputation: 359

Making a plot of categorical data with two predictors

This question is an extension of a previous one I asked, with slightly more complex data. It seems quite basic, but I've been banging my head against the wall for several days over this.

I need to create plots of the percentage of prevalence of the dependent variable (choice) by the independent variables ses (x-axis) and agegroup (perhaps a stacked barplot grouping). Ideally, I'd like the plot to be a side-by-side 2-faceted plot, with one facet per sex.

The relevant part of my data is in this form:

subject   choice       agegroup    sex       ses

John      square       2           Female    A
John      triangle     2           Female    A
John      triangle     2           Female    A
Mary      circle       2           Female    C
Mary      square       2           Female    C
Mary      rectangle    2           Female    C
Mary      square       2           Female    C
Hodor     hodor        5           Male      D
Hodor     hodor        5           Male      D
Hodor     hodor        5           Male      D
Hodor     hodor        5           Male      D
Jill      square       3           Female    B
Jill      circle       3           Female    B
Jill      square       3           Female    B
Jill      hodor        3           Female    B
Jill      triangle     3           Female    B
Jill      rectangle    3           Female    B
... [about 12,000 more observations follow]

I want to use ggplot2 for its power and flexibility, as well as its apparent ease of use. But every tutorial or how-to I've found starts out with 90% of the work already done, by virtue of the fact that they just load up one of the built-in datasets that are provided by R or its packages. But of course I need to use my own data.

I'm aware of the need to convert it to longform in order for ggplot2 to be able to use it, but I just haven't been able to manage to do it right. And I've become even more confused by all the different data manipulation packages that are out there, and how some seem to be a part of others, or something along those lines.

EDIT: I'm beginning to realize that plotting this with a line plot, as per my original question, won't work. At least I don't think so now. So here's a mock-up of a possible way of graphing this dataset (with completely fictional values):

Colors represent different responses to choice.

Could someone please lend me a hand with this? And if you have any suggestions for a better way to visualize the data, please share!

Upvotes: 1

Answers (2)

Andrew Lavers

Reputation: 4378

This shows both point and stacked bar chart for the revised question. Some guidance in thinking the visualization: Do you already know the "story" in your data? If not then you may need to work through many visualizations to discover the story, the build the visualization that best shows the story.

df <- read.table(text='subject choice agegroup sex ses                                      
John square 2 Female A                                                                      
John triangle 2 Female A                                                                    
John triangle 2 Female A                                                                    
Mary circle 2 Female C                                                                      
Mary square 2 Female C                                                                      
Mary rectangle 2 Female C                                                                   
Mary square 2 Female C                                                                      
Hodor hodor 5 Male D                                                                        
Hodor hodor 5 Male D                                                                        
Hodor hodor 5 Male D                                                                        
Hodor hodor 5 Male D                                                                        
Jill square 3 Female B                                                                      
Jill circle 3 Female B                                                                      
Jill square 3 Female B                                                                      
Jill hodor 3 Female B                                                                       
Jill triangle 3 Female B                                                                    
Jill rectangle 3 Female B', header=TRUE)                                                    

library(tidyverse)                                                                          
#> ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.4
#> ✔ tidyr   0.8.0     ✔ stringr 1.3.0
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

# agegroup is read as numeric - convert to a factor                                         
df$agegroup <- factor(df$agegroup)                                                          

# Create dataframe by subject (check for data issues!!)                                     
df_subject <- df %>%                                                                        
group_by(subject, agegroup, ses, sex) %>%                                                   
summarize()                                                                                 
df_subject                                                                                  
#> # A tibble: 4 x 4
#> # Groups:   subject, agegroup, ses [?]
#>   subject agegroup ses   sex   
#>   <fct>   <fct>    <fct> <fct> 
#> 1 Hodor   5        D     Male  
#> 2 Jill    3        B     Female
#> 3 John    2        A     Female
#> 4 Mary    2        C     Female

# calculate the proportionate choice by subject                                             
df_subject_choice <- df %>%                                                                 
# summarize the counts by the finest group to analyze                                       
group_by(subject, choice) %>%                                                               
summarize(n=n()) %>%                                                                        
# calculate proportions based on counts                                                     
mutate(p=prop.table(n))                                                                     
df_subject_choice                                                                           
#> # A tibble: 11 x 4
#> # Groups:   subject [4]
#>    subject choice        n     p
#>    <fct>   <fct>     <int> <dbl>
#>  1 Hodor   hodor         4 1.00 
#>  2 Jill    circle        1 0.167
#>  3 Jill    hodor         1 0.167
#>  4 Jill    rectangle     1 0.167
#>  5 Jill    square        2 0.333
#>  6 Jill    triangle      1 0.167
#>  7 John    square        1 0.333
#>  8 John    triangle      2 0.667
#>  9 Mary    circle        1 0.250
#> 10 Mary    rectangle     1 0.250
#> 11 Mary    square        2 0.500

# Put the results together by joining                                                       
df_joined <- df_subject_choice %>%                                                          
left_join(df_subject, by = "subject") %>%                                                   
select(subject, ses, sex, agegroup, choice, p)                                              
df_joined                                                                                   
#> # A tibble: 11 x 6
#> # Groups:   subject [4]
#>    subject ses   sex    agegroup choice        p
#>    <fct>   <fct> <fct>  <fct>    <fct>     <dbl>
#>  1 Hodor   D     Male   5        hodor     1.00 
#>  2 Jill    B     Female 3        circle    0.167
#>  3 Jill    B     Female 3        hodor     0.167
#>  4 Jill    B     Female 3        rectangle 0.167
#>  5 Jill    B     Female 3        square    0.333
#>  6 Jill    B     Female 3        triangle  0.167
#>  7 John    A     Female 2        square    0.333
#>  8 John    A     Female 2        triangle  0.667
#>  9 Mary    C     Female 2        circle    0.250
#> 10 Mary    C     Female 2        rectangle 0.250
#> 11 Mary    C     Female 2        square    0.500

# Summarize to whatever level to analyze (Note that this may be possible directly in ggplot)
df_summary <- df_joined %>%                                                                 
group_by(agegroup, ses, sex, choice) %>%                                                    
summarize(p_mean = mean(p))                                                                 
df_summary                                                                                  
#> # A tibble: 11 x 5
#> # Groups:   agegroup, ses, sex [?]
#>    agegroup ses   sex    choice    p_mean
#>    <fct>    <fct> <fct>  <fct>      <dbl>
#>  1 2        A     Female square     0.333
#>  2 2        A     Female triangle   0.667
#>  3 2        C     Female circle     0.250
#>  4 2        C     Female rectangle  0.250
#>  5 2        C     Female square     0.500
#>  6 3        B     Female circle     0.167
#>  7 3        B     Female hodor      0.167
#>  8 3        B     Female rectangle  0.167
#>  9 3        B     Female square     0.333
#> 10 3        B     Female triangle   0.167
#> 11 5        D     Male   hodor      1.00

# Plot points                                                                               
ggplot(df_summary, aes(x = ses, y = choice, color = agegroup, size = p_mean)) +             
geom_point() +                                                                              
facet_wrap(~sex)

# Plot faceted 100% stacked bar                                                             
ggplot(df_summary, aes(x = agegroup, y = p_mean, color = choice, fill=choice)) +            
geom_col() +                                                                                
facet_grid(sex~ses)

Upvotes: 0

Wimpel

Reputation: 27732

Not sure if I understand your desired output correctly.. so here's a first try

library( tidyverse )

df2 <- df %>% 
  mutate( agegroup = as.factor( agegroup ) ) %>%
  group_by( ses, agegroup, sex, choice ) %>%
  summarise( count = n() )

#   ses   agegroup sex    choice    count
#   <fct> <fct>    <fct>  <fct>     <int>
# 1 A     2        Female square        1
# 2 A     2        Female triangle      2
# 3 B     3        Female circle        1
# 4 B     3        Female hodor         1
# 5 B     3        Female rectangle     1
# 6 B     3        Female square        2
# 7 B     3        Female triangle      1
# 8 C     2        Female circle        1
# 9 C     2        Female rectangle     1
# 10 C     2        Female square        2
# 11 D     5        Male   hodor         4

ggplot(df2, aes( x = ses, y = count, group=agegroup, colour = agegroup)) +
  geom_point( stat='summary', fun.y=sum) +
  stat_summary(fun.y=sum, geom="line") + 
  facet_grid( c("choice", "sex" ) )

Upvotes: 1

Making a plot of categorical data with two predictors

Answers (2)

Related Questions