Perform multiple summary functions and return a dataframe

Question

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:

students <- read.table(text = "zip  school  score
              43050 'Hunter'    202.72974236
              48227 'NYU'   338.49571519
              48227 'NYU'   223.48658339
              32566 'CCNY'  310.40666224
              78596 'Columbia'  821.59318662
              78045 'Columbia'  853.09842034
              60651 'Lang'  277.48624384
              32566 'Lang'  315.49753763
              32566 'Lang'  80.296556533
              94941 'LIU'   373.53839238
              ",header = TRUE,sep = "")

I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:

tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))

Giving them better column names:

colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")

And using cbind to tie them all back together again:

school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)

I think the cleaner way to do this is:

library(plyr)
school.stats <- ddply(students, .(school), summarise, 
                     record.count=length(score),  
                     unique.r.zips=length(unique(zip)),
                     mean.dist=mean(score), 
                     total.dist=sum(score)
)

The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.

Amanda · Accepted Answer

Comments seem to be in general agreement: this looks good.

Perform multiple summary functions and return a dataframe

Answers (2)

Related Questions