Reputation: 359
This question is an extension of a previous one I asked, with slightly more complex data. It seems quite basic, but I've been banging my head against the wall for several days over this.
I need to create plots of the percentage of prevalence of the dependent variable (choice
) by the independent variables ses
(x-axis) and agegroup
(perhaps a stacked barplot grouping). Ideally, I'd like the plot to be a side-by-side 2-faceted plot, with one facet per sex.
The relevant part of my data is in this form:
subject choice agegroup sex ses
John square 2 Female A
John triangle 2 Female A
John triangle 2 Female A
Mary circle 2 Female C
Mary square 2 Female C
Mary rectangle 2 Female C
Mary square 2 Female C
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Jill square 3 Female B
Jill circle 3 Female B
Jill square 3 Female B
Jill hodor 3 Female B
Jill triangle 3 Female B
Jill rectangle 3 Female B
... [about 12,000 more observations follow]
I want to use ggplot2
for its power and flexibility, as well as its apparent ease of use. But every tutorial or how-to I've found starts out with 90% of the work already done, by virtue of the fact that they just load up one of the built-in datasets that are provided by R or its packages. But of course I need to use my own data.
I'm aware of the need to convert it to longform in order for ggplot2
to be able to use it, but I just haven't been able to manage to do it right. And I've become even more confused by all the different data manipulation packages that are out there, and how some seem to be a part of others, or something along those lines.
EDIT: I'm beginning to realize that plotting this with a line plot, as per my original question, won't work. At least I don't think so now. So here's a mock-up of a possible way of graphing this dataset (with completely fictional values):
Colors represent different responses to choice
.
Could someone please lend me a hand with this? And if you have any suggestions for a better way to visualize the data, please share!
Upvotes: 1
Views: 449
Reputation: 4378
This shows both point and stacked bar chart for the revised question. Some guidance in thinking the visualization: Do you already know the "story" in your data? If not then you may need to work through many visualizations to discover the story, the build the visualization that best shows the story.
df <- read.table(text='subject choice agegroup sex ses
John square 2 Female A
John triangle 2 Female A
John triangle 2 Female A
Mary circle 2 Female C
Mary square 2 Female C
Mary rectangle 2 Female C
Mary square 2 Female C
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Hodor hodor 5 Male D
Jill square 3 Female B
Jill circle 3 Female B
Jill square 3 Female B
Jill hodor 3 Female B
Jill triangle 3 Female B
Jill rectangle 3 Female B', header=TRUE)
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.8.0 ✔ stringr 1.3.0
#> ✔ readr 1.1.1 ✔ forcats 0.3.0
#> ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
# agegroup is read as numeric - convert to a factor
df$agegroup <- factor(df$agegroup)
# Create dataframe by subject (check for data issues!!)
df_subject <- df %>%
group_by(subject, agegroup, ses, sex) %>%
summarize()
df_subject
#> # A tibble: 4 x 4
#> # Groups: subject, agegroup, ses [?]
#> subject agegroup ses sex
#> <fct> <fct> <fct> <fct>
#> 1 Hodor 5 D Male
#> 2 Jill 3 B Female
#> 3 John 2 A Female
#> 4 Mary 2 C Female
# calculate the proportionate choice by subject
df_subject_choice <- df %>%
# summarize the counts by the finest group to analyze
group_by(subject, choice) %>%
summarize(n=n()) %>%
# calculate proportions based on counts
mutate(p=prop.table(n))
df_subject_choice
#> # A tibble: 11 x 4
#> # Groups: subject [4]
#> subject choice n p
#> <fct> <fct> <int> <dbl>
#> 1 Hodor hodor 4 1.00
#> 2 Jill circle 1 0.167
#> 3 Jill hodor 1 0.167
#> 4 Jill rectangle 1 0.167
#> 5 Jill square 2 0.333
#> 6 Jill triangle 1 0.167
#> 7 John square 1 0.333
#> 8 John triangle 2 0.667
#> 9 Mary circle 1 0.250
#> 10 Mary rectangle 1 0.250
#> 11 Mary square 2 0.500
# Put the results together by joining
df_joined <- df_subject_choice %>%
left_join(df_subject, by = "subject") %>%
select(subject, ses, sex, agegroup, choice, p)
df_joined
#> # A tibble: 11 x 6
#> # Groups: subject [4]
#> subject ses sex agegroup choice p
#> <fct> <fct> <fct> <fct> <fct> <dbl>
#> 1 Hodor D Male 5 hodor 1.00
#> 2 Jill B Female 3 circle 0.167
#> 3 Jill B Female 3 hodor 0.167
#> 4 Jill B Female 3 rectangle 0.167
#> 5 Jill B Female 3 square 0.333
#> 6 Jill B Female 3 triangle 0.167
#> 7 John A Female 2 square 0.333
#> 8 John A Female 2 triangle 0.667
#> 9 Mary C Female 2 circle 0.250
#> 10 Mary C Female 2 rectangle 0.250
#> 11 Mary C Female 2 square 0.500
# Summarize to whatever level to analyze (Note that this may be possible directly in ggplot)
df_summary <- df_joined %>%
group_by(agegroup, ses, sex, choice) %>%
summarize(p_mean = mean(p))
df_summary
#> # A tibble: 11 x 5
#> # Groups: agegroup, ses, sex [?]
#> agegroup ses sex choice p_mean
#> <fct> <fct> <fct> <fct> <dbl>
#> 1 2 A Female square 0.333
#> 2 2 A Female triangle 0.667
#> 3 2 C Female circle 0.250
#> 4 2 C Female rectangle 0.250
#> 5 2 C Female square 0.500
#> 6 3 B Female circle 0.167
#> 7 3 B Female hodor 0.167
#> 8 3 B Female rectangle 0.167
#> 9 3 B Female square 0.333
#> 10 3 B Female triangle 0.167
#> 11 5 D Male hodor 1.00
# Plot points
ggplot(df_summary, aes(x = ses, y = choice, color = agegroup, size = p_mean)) +
geom_point() +
facet_wrap(~sex)
# Plot faceted 100% stacked bar
ggplot(df_summary, aes(x = agegroup, y = p_mean, color = choice, fill=choice)) +
geom_col() +
facet_grid(sex~ses)
Upvotes: 0
Reputation: 27732
Not sure if I understand your desired output correctly.. so here's a first try
library( tidyverse )
df2 <- df %>%
mutate( agegroup = as.factor( agegroup ) ) %>%
group_by( ses, agegroup, sex, choice ) %>%
summarise( count = n() )
# ses agegroup sex choice count
# <fct> <fct> <fct> <fct> <int>
# 1 A 2 Female square 1
# 2 A 2 Female triangle 2
# 3 B 3 Female circle 1
# 4 B 3 Female hodor 1
# 5 B 3 Female rectangle 1
# 6 B 3 Female square 2
# 7 B 3 Female triangle 1
# 8 C 2 Female circle 1
# 9 C 2 Female rectangle 1
# 10 C 2 Female square 2
# 11 D 5 Male hodor 4
ggplot(df2, aes( x = ses, y = count, group=agegroup, colour = agegroup)) +
geom_point( stat='summary', fun.y=sum) +
stat_summary(fun.y=sum, geom="line") +
facet_grid( c("choice", "sex" ) )
Upvotes: 1