Amanda
Amanda

Reputation: 12747

Perform multiple summary functions and return a dataframe

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:

students <- read.table(text = "zip  school  score
              43050 'Hunter'    202.72974236
              48227 'NYU'   338.49571519
              48227 'NYU'   223.48658339
              32566 'CCNY'  310.40666224
              78596 'Columbia'  821.59318662
              78045 'Columbia'  853.09842034
              60651 'Lang'  277.48624384
              32566 'Lang'  315.49753763
              32566 'Lang'  80.296556533
              94941 'LIU'   373.53839238
              ",header = TRUE,sep = "")

I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:

tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))

Giving them better column names:

colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")

And using cbind to tie them all back together again:

school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)

I think the cleaner way to do this is:

library(plyr)
school.stats <- ddply(students, .(school), summarise, 
                     record.count=length(score),  
                     unique.r.zips=length(unique(zip)),
                     mean.dist=mean(score), 
                     total.dist=sum(score)
)

The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.

Upvotes: 2

Views: 170

Answers (2)

Amanda
Amanda

Reputation: 12747

Comments seem to be in general agreement: this looks good.

Upvotes: 0

Troy
Troy

Reputation: 8691

If performance is an issue, you can also use data.table

require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
        avg=mean(score),
        unique.zips=length(unique(zip)),
        records=length(score)),
   by="school"]

     school     total      avg unique.zips records
1:   Hunter  202.7297 202.7297           1       1
2:      NYU  561.9823 280.9911           1       2
3:     CCNY  310.4067 310.4067           1       1
4: Columbia 1674.6916 837.3458           2       2
5:     Lang  673.2803 224.4268           2       3
6:      LIU  373.5384 373.5384           1       1

Upvotes: 3

Related Questions