James Todd
James Todd

Reputation: 123

Aggregate dataframe into a frequency table

I am looking to reshape a dataframe from something which looks like this, with variables:

Year, University, Degree, Gender 

With each row depicting an entry of a student, eg:

2017, University College London, Social Science, Male 

2017, University of Leeds, Social science, Non-Binary 

I would like to create a frequency table from this data in order to condense the number of rows so that for each university, there are 19 rows for each of the degree categories, and then for each Degree, the count/frequency for each gender is shown, which would look something like this.

Year University Degree [Gender (Male, Female, Non-Binary)]

2017 UCL Biological Sciences 1 0 2  

I hope this makes sense. Thank you for the help.

EDIT: I would now like to be able to plot this data as a line graph using a subset of the data. I am currently subsetting outside of the plotting function like so

   subsetucl <- TFtab[which(TFtab$University == 'University College London'),]
ggplot(data=subsetucl, aes(x=Year, y=Female, group=Degree, color = Degree)) + geom_line()+ geom_point(size = 0.8) + xlab("Year of application") + ylab("Frequnecy of Females") + ggtitle("UCL Applications by Degree (2011-2017)") + theme_bw()

What is the best way to subset the data within the plotting function and how do I best display lines for all Genders rather than just female frequencies. Thank you

Upvotes: 1

Views: 2482

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269371

1) aggregate/model.matrix Try this one-line aggregate solution. No packages are used.

aggregate(model.matrix(~ Gender + 0) ~ Year + University + Degree, DF, sum)

giving:

  Year                University         Degree GenderFemale GenderMale GenderNon-Binary
1 2017       University of Leeds Social science            1          0                1
2 2017 University College London Social Science            0          1                0

2) aggregate/cbind It would also be possible to write out the model.matrix(...) part using cbind(...) like this which may be clearer although tedious:

aggregate(cbind(Female = Gender == "Female", Male = Gender == "Male",
            `Non-Binary` = Gender == "Non-Binary") ~ Year + University + Degree, DF, sum)

giving the following which is the same as above except for a slight variation in the column names:

  Year                University         Degree Female Male Non-Binary
1 2017       University of Leeds Social science      1    0          1
2 2017 University College London Social Science      0    1          0

Note: The input used in the example above in reproducible form is:

Lines <- "Year, University, Degree, Gender 
2017, University College London, Social Science, Male 
2017, University of Leeds, Social science, Non-Binary
2017, University of Leeds, Social science, Female"
DF <- read.csv(text = Lines, strip.white = TRUE)

Upvotes: 0

svenhalvorson
svenhalvorson

Reputation: 1080

Heres a very solution with dplyr.

library("dplyr")
data %>%
   group_by(University, Degree, Gender) %>%
   count( )%>% 
   spread(key = Gender, value = n, fill = 0)

But seriously use the search function on stack overflow. Here's a book to help with R

Upvotes: 2

Related Questions