bismo
bismo

Reputation: 1439

Summary Tables in R

I am a python user and trying to pick up R to become more versatile. I decided to go through the book R for DS. I have been met with a challenge in chapter 15 which covers factors in R. R seems to handle categorical variables quite a bit differently than python does. Using the gss_cat data (which is built into R), I was able to make some line plots to visualize the proportions of Democrats, Republican, and Independents and their change overtime (thanks to this helpful resource).

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat"))) %>%
  group_by(year, partyid) %>%
  summarize(n = n()) %>%
  ggplot(mapping = aes(x = year, y = n, color = fct_reorder2(partyid, year, n))) +
  geom_point() +
  geom_line() +
  ggtitle("Proportions of Democrat, Republican, and Independent Over Time") +
   labs(color = 'Party',x = 'Year',y = 'Count') +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal()

However, one of the next questions is to make a summary table that calculates the proportions of individuals who identify as Democrat, Republican, and Independent for each year, and I am completely lost. I'm sure this is pretty easy, and I have a good idea how I would go about it in python, but I must admit I am stumped when it comes to R. It just isn't very intuitive for me.

Is there a summary table function in R? How do I do this with the factors? Thanks!

Upvotes: 0

Views: 209

Answers (3)

Duck
Duck

Reputation: 39613

Re-using your useful code, maybe you are looking for this:

library(dplyr)
#Code
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
                                other = c("No answer", "Don't know", "Other party"),
                                rep = c("Strong republican", "Not str republican"),
                                ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                                dem = c("Not str democrat", "Strong democrat"))) %>%
  group_by(year, partyid) %>%
  summarize(n = n()) %>%
  filter(partyid!='other') %>%
  ungroup() %>%
  group_by(year) %>%
  mutate(Prop=n/sum(n))

Output:

# A tibble: 24 x 4
# Groups:   year [8]
    year partyid     n  Prop
   <int> <fct>   <int> <dbl>
 1  2000 rep       684 0.248
 2  2000 ind      1152 0.418
 3  2000 dem       921 0.334
 4  2002 rep       764 0.285
 5  2002 ind       994 0.371
 6  2002 dem       923 0.344
 7  2004 rep       821 0.296
 8  2004 ind       991 0.358
 9  2004 dem       959 0.346
10  2006 rep      1132 0.256
# ... with 14 more rows

The ungroup() option can be avoided with this (Many thanks and credits to @NotThatKindODr):

#Code 2
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
                                other = c("No answer", "Don't know", "Other party"),
                                rep = c("Strong republican", "Not str republican"),
                                ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                                dem = c("Not str democrat", "Strong democrat"))) %>%
  group_by(year, partyid) %>%
  summarize(n = n()) %>%
  filter(partyid!='other') %>%
  group_by(year,.add = FALSE) %>%
  mutate(Prop=n/sum(n))

Output:

# A tibble: 24 x 4
# Groups:   year [8]
    year partyid     n  Prop
   <int> <fct>   <int> <dbl>
 1  2000 rep       684 0.248
 2  2000 ind      1152 0.418
 3  2000 dem       921 0.334
 4  2002 rep       764 0.285
 5  2002 ind       994 0.371
 6  2002 dem       923 0.344
 7  2004 rep       821 0.296
 8  2004 ind       991 0.358
 9  2004 dem       959 0.346
10  2006 rep      1132 0.256
# ... with 14 more rows

Same output.

Upvotes: 2

NotThatKindODr
NotThatKindODr

Reputation: 719

Using the code you already have built.

gss_cat %>%
   mutate(partyid = fct_collapse(partyid,
                                 other = c("No answer", "Don't know", "Other party"),
                                 rep = c("Strong republican", "Not str republican"),
                                 ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                                 dem = c("Not str democrat", "Strong democrat"))) %>%
                                 group_by(year, partyid) %>%
   summarize(n = n()) %>%
   group_by(year) %>%
   mutate(prop = n/sum(n))
                                  

Upvotes: 1

Abdessabour Mtk
Abdessabour Mtk

Reputation: 3888

basically you need to re-group by year so you could get the proportions. two ways to do this either by using n/sum(n) or prop.table(n). I multiplied by 100 to get the percentages:

gss_cat %>% 
 mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat"))) %>%
  group_by(year, partyid) %>% summarize(n = n()) -> summed
summed %>% group_by(year) %>% mutate(ptg = prop.table(n)*100)
summed %>% group_by(year) %>% mutate(ptg = n/sum(n)*100)
# A tibble: 24 x 4
# Groups:   year [8]
    year partyid     n   ptg
   <int> <fct>   <int> <dbl>
 1  2000 rep       684  24.8
 2  2000 ind      1152  41.8
 3  2000 dem       921  33.4
 4  2002 rep       764  28.5
 5  2002 ind       994  37.1
 6  2002 dem       923  34.4
 7  2004 rep       821  29.6
 8  2004 ind       991  35.8
 9  2004 dem       959  34.6
10  2006 rep      1132  25.6
# … with 14 more rows

Upvotes: 1

Related Questions