Reputation: 123
I am looking to reshape a dataframe from something which looks like this, with variables:
Year, University, Degree, Gender
With each row depicting an entry of a student, eg:
2017, University College London, Social Science, Male
2017, University of Leeds, Social science, Non-Binary
I would like to create a frequency table from this data in order to condense the number of rows so that for each university, there are 19 rows for each of the degree categories, and then for each Degree, the count/frequency for each gender is shown, which would look something like this.
Year University Degree [Gender (Male, Female, Non-Binary)]
2017 UCL Biological Sciences 1 0 2
I hope this makes sense. Thank you for the help.
EDIT: I would now like to be able to plot this data as a line graph using a subset of the data. I am currently subsetting outside of the plotting function like so
subsetucl <- TFtab[which(TFtab$University == 'University College London'),]
ggplot(data=subsetucl, aes(x=Year, y=Female, group=Degree, color = Degree)) + geom_line()+ geom_point(size = 0.8) + xlab("Year of application") + ylab("Frequnecy of Females") + ggtitle("UCL Applications by Degree (2011-2017)") + theme_bw()
What is the best way to subset the data within the plotting function and how do I best display lines for all Genders rather than just female frequencies. Thank you
Upvotes: 1
Views: 2482
Reputation: 269371
1) aggregate/model.matrix Try this one-line aggregate solution. No packages are used.
aggregate(model.matrix(~ Gender + 0) ~ Year + University + Degree, DF, sum)
giving:
Year University Degree GenderFemale GenderMale GenderNon-Binary
1 2017 University of Leeds Social science 1 0 1
2 2017 University College London Social Science 0 1 0
2) aggregate/cbind It would also be possible to write out the model.matrix(...)
part using cbind(...)
like this which may be clearer although tedious:
aggregate(cbind(Female = Gender == "Female", Male = Gender == "Male",
`Non-Binary` = Gender == "Non-Binary") ~ Year + University + Degree, DF, sum)
giving the following which is the same as above except for a slight variation in the column names:
Year University Degree Female Male Non-Binary
1 2017 University of Leeds Social science 1 0 1
2 2017 University College London Social Science 0 1 0
Note: The input used in the example above in reproducible form is:
Lines <- "Year, University, Degree, Gender
2017, University College London, Social Science, Male
2017, University of Leeds, Social science, Non-Binary
2017, University of Leeds, Social science, Female"
DF <- read.csv(text = Lines, strip.white = TRUE)
Upvotes: 0
Reputation: 1080
Heres a very solution with dplyr.
library("dplyr")
data %>%
group_by(University, Degree, Gender) %>%
count( )%>%
spread(key = Gender, value = n, fill = 0)
But seriously use the search function on stack overflow. Here's a book to help with R
Upvotes: 2