Reputation: 1439
I am a python user and trying to pick up R to become more versatile. I decided to go through the book R for DS. I have been met with a challenge in chapter 15 which covers factors in R. R seems to handle categorical variables quite a bit differently than python does. Using the gss_cat
data (which is built into R), I was able to make some line plots to visualize the proportions of Democrats, Republican, and Independents and their change overtime (thanks to this helpful resource).
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
group_by(year, partyid) %>%
summarize(n = n()) %>%
ggplot(mapping = aes(x = year, y = n, color = fct_reorder2(partyid, year, n))) +
geom_point() +
geom_line() +
ggtitle("Proportions of Democrat, Republican, and Independent Over Time") +
labs(color = 'Party',x = 'Year',y = 'Count') +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
However, one of the next questions is to make a summary table that calculates the proportions of individuals who identify as Democrat, Republican, and Independent for each year, and I am completely lost. I'm sure this is pretty easy, and I have a good idea how I would go about it in python, but I must admit I am stumped when it comes to R. It just isn't very intuitive for me.
Is there a summary table function in R? How do I do this with the factors? Thanks!
Upvotes: 0
Views: 209
Reputation: 39613
Re-using your useful code, maybe you are looking for this:
library(dplyr)
#Code
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
group_by(year, partyid) %>%
summarize(n = n()) %>%
filter(partyid!='other') %>%
ungroup() %>%
group_by(year) %>%
mutate(Prop=n/sum(n))
Output:
# A tibble: 24 x 4
# Groups: year [8]
year partyid n Prop
<int> <fct> <int> <dbl>
1 2000 rep 684 0.248
2 2000 ind 1152 0.418
3 2000 dem 921 0.334
4 2002 rep 764 0.285
5 2002 ind 994 0.371
6 2002 dem 923 0.344
7 2004 rep 821 0.296
8 2004 ind 991 0.358
9 2004 dem 959 0.346
10 2006 rep 1132 0.256
# ... with 14 more rows
The ungroup()
option can be avoided with this (Many thanks and credits to @NotThatKindODr):
#Code 2
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
group_by(year, partyid) %>%
summarize(n = n()) %>%
filter(partyid!='other') %>%
group_by(year,.add = FALSE) %>%
mutate(Prop=n/sum(n))
Output:
# A tibble: 24 x 4
# Groups: year [8]
year partyid n Prop
<int> <fct> <int> <dbl>
1 2000 rep 684 0.248
2 2000 ind 1152 0.418
3 2000 dem 921 0.334
4 2002 rep 764 0.285
5 2002 ind 994 0.371
6 2002 dem 923 0.344
7 2004 rep 821 0.296
8 2004 ind 991 0.358
9 2004 dem 959 0.346
10 2006 rep 1132 0.256
# ... with 14 more rows
Same output.
Upvotes: 2
Reputation: 719
Using the code you already have built.
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
group_by(year, partyid) %>%
summarize(n = n()) %>%
group_by(year) %>%
mutate(prop = n/sum(n))
Upvotes: 1
Reputation: 3888
basically you need to re-group by year so you could get the proportions.
two ways to do this either by using n/sum(n)
or prop.table(n)
. I multiplied by 100 to get the percentages:
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
group_by(year, partyid) %>% summarize(n = n()) -> summed
summed %>% group_by(year) %>% mutate(ptg = prop.table(n)*100)
summed %>% group_by(year) %>% mutate(ptg = n/sum(n)*100)
# A tibble: 24 x 4
# Groups: year [8]
year partyid n ptg
<int> <fct> <int> <dbl>
1 2000 rep 684 24.8
2 2000 ind 1152 41.8
3 2000 dem 921 33.4
4 2002 rep 764 28.5
5 2002 ind 994 37.1
6 2002 dem 923 34.4
7 2004 rep 821 29.6
8 2004 ind 991 35.8
9 2004 dem 959 34.6
10 2006 rep 1132 25.6
# … with 14 more rows
Upvotes: 1