Reputation: 71
I'm using the ddply function (plyr) to calculate something separately by participant id (pid). However, for some reason it's not returning separate values by pid, but rather the same value across all pid.
Sample data:
sdt<-c("Hit","Hit","Miss","Miss","False Alarm","Correct Reject","Correct Reject","Correct Reject",
"Hit","Hit","Hit","Miss","False Alarm","False Alarm","False ALarm","Correct Reject")
pid<-c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
adhd_p<-data.frame(sdt,pid)
Function:
ddply(adhd_p, "pid", summarise,
hitrate=(count(adhd_p$sdt=="Hit")[[2,2]])/((count(adhd_perf$sdt=="Hit")[[2,2]])+(count(adhd_p$sdt=="Miss")[[2,2]])),
falsealarmrate=(count(adhd_p$sdt=="False Alarm")[[2,2]])/((count(adhd_p$sdt=="False Alarm")[[2,2]])+(count(adhd_p$sdt=="Correct Reject")[[2,2]])))
If it helps to understand what I'm calculating... Participants can either "Hit" (respond affirmatively to target), "Miss" (do not respond to target), "Correct Reject" (do not respond to distractor), or "False Alarm" (respond affirmatively to distractor). Thus, "hitrate" is number of hits/hits+misses, and "falsealarmrate" is number of false alarms/false alarms+correct reject.
What am I doing wrong?
Thanks for your time.
Edit: Above problem solved very quickly by editing code to
ddply(adhd_p, "pid", summarise,
hitrate=(count(sdt=="Hit")[[2,2]])/((count(sdt=="Hit")[[2,2]])+(count(sdt=="Miss")[[2,2]])),
falsealarmrate=(count(sdt=="False Alarm")[[2,2]])/((count(sdt=="False Alarm")[[2,2]])+(count(adhd_p$sdt=="Correct Reject")[[2,2]])))
I realize now that I need to split over two variables rather than just one. However adding a time variable:
time<-c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8)
And merging it in with the others
adhd_p<-data.frame(sdt,pid,time)
Makes the new script produce a "subscript out of bounds" error.
ddply(adhd_p, .(pid,time), summarise,
hitrate=(count(sdt=="Hit")[[2,2]])/((count(sdt=="Hit")[[2,2]])+(count(sdt=="Miss")[[2,2]])),
falsealarmrate=(count(sdt=="False Alarm")[[2,2]])/((count(sdt=="False Alarm")[[2,2]])+(count(sdt=="Correct Reject")[[2,2]])))
Any thoughts?
Upvotes: 1
Views: 872
Reputation: 3991
What you need to be doing:
ddply(adhd_p, "pid", summarise,
hitrate=(count(sdt=="Hit")[[2,2]])/((count(sdt=="Hit")[[2,2]])+(count(sdt=="Miss")[[2,2]])),
falsealarmrate=(count(sdt=="False Alarm")[[2,2]])/((count(sdt=="False Alarm")[[2,2]])+(count(sdt=="Correct Reject")[[2,2]])))
Why you need to be doing it:
When you call ddply
, the function works within the .data
(adhd_p
in your case) as the local namespace. This is similar to calling attach(adhd_p)
; calling the name of a column without referencing the dataframe explicitly still calls the correct column.
When you supply the summarise
argument, the function splits up vectors in the local namespace based on the the id columns supplied (in this case, pid
). So, if you reference columns without referencing the dataframe explicitly as above, calculations will be done with the portion of the sdt
column corresponding to each pid
. However, if you reference the column and dataframe explictly (adhd_p$sdt
in your case), it just pulls in the entire vector from the global namespace and doesn't split it appropriately.
Edit: the code below is both less messy and won't raise an error if one of the values is missing:
ddply(adhd_p, .(pid, time), summarise,
hitrate=(sum(sdt=="Hit"))/(sum(sdt=="Hit"))+(sum(sdt=="Miss")),
falsealarmrate=(sum(sdt=="False Alarm"))/(sum(sdt=="False Alarm"))+(sum(sdt=="Correct Reject")))
Upvotes: 2
Reputation: 52637
I haven't delved into why what you are doing is wrong, but here is an answer that might help:
ddply(
adhd_p, "pid", summarize,
hitrate=sum(sdt == "Hit") / sum(sdt %in% c("Hit", "Miss")),
falsealarmrate=sum(sdt == "False Alarm") / sum(sdt %in% c("False Alarm", "Correct Reject"))
)
Produces:
pid hitrate falsealarmrate
1 1 0.50 0.2500000
2 2 0.75 0.6666667
Upvotes: 1