Reputation: 3919
I'm trying to do an equivalent group by
summary in R
through the plyr
function named ddply
. I have a data frame which have three columns (say id
, period
and event
). Then, I'd like to count the times each id
appears in the data frame (count(*)... group by id
with SQL
) and get the last element of each id
corresponding to the column event
.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1
? What am I doing wrong?
It is not clear for me when I have to use summarise
or transform
. Could you tell me the correct syntax for the ddply
function?
Upvotes: 0
Views: 815
Reputation: 173737
When you use summarise
, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))
Upvotes: 5