Doubts about ddply function in R

Question

I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.

Here an example of what I have and what I'm trying to obtain:

  id period event #original data frame
  1      1     1
  2      1     0
  2      2     1
  3      1     1
  4      1     1
  4      1     0

  id  t  x #what I want to obtain
  1   1  1
  2   2  1
  3   1  1
  4   2  0

This is the simple code I've been using for that:

 teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
 datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.

Now, I've been reading The Split-Apply-Combine Strategy for Data Analysis and it is given an example where they employed an equivalent syntax to the one I put below:

  datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.

This is the data frame I get using datos2

So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?

It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?

joran · Accepted Answer

When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.

You tried this:

ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])

when what you probably wanted was something more like this:

ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))

Doubts about ddply function in R

Answers (1)

Related Questions