user3221541
user3221541

Reputation: 11

R: creating subset of rows (based on column name) for further analysis

I am very new to R so this may be a silly question. Please bear with…

We have assessed participants' attention in our study. Each participant completed 365 trials in one of two conditions; we noted responses, accuracy, etc. Now, the first row of each column represents the headers for the above:

participant_id  trial  condition  accuracy  etc.
 101            1         0        1       ... 
 101            2         0        1       ...
 101            3         0        0       ...
 102            1         3        1       ...
 102            2         3        0       ...

I want to calculate the overall average accuracy for the first versus the last 120 trials. Note: of the 365 trails, the first five are for practise of the task only. Thus, I am looking to get the descriptives (mean, standard deviation etc.) for the overall accuracy on trials 6-125 (first 120) and 246-365 (last 120).

I have tried using the subset()command to split my data up, but am not sure it's the appropriate function. Also uncertain about the best way to then calculate my means.

#split data.sub into first and last 120 trials

data.sub120=subset(data.sub, data.sub$trial== 6:125)
data.sub120last=subset (data.sub, data.sub$trial== 246:365)
stat.desc (data.sub120,data.sub120last)

Any help would be appreciated - sorry if I'm wasting anyone's time, still learning!

Thanks!

Upvotes: 1

Views: 410

Answers (4)

marbel
marbel

Reputation: 7714

Here is another solution, in line with Brandson's using the data.table package. It's faster than plyr, but i find the syntax for aggregation problems more intuitive. Here is the documentation for further refference.

demo.data <- data.frame(participant.id = c(rep(101, 365), rep(102, 365), rep(103, 365)),
                        trial = c(1:365, 1:365, 1:365),
                        condition = letters[1:5],
                        accuracy = rbinom(365*3, 1, 0.5))

require("data.table")
DT <- data.table(demo.data)

DT$fc_trial <- cut(DT$trial, breaks = c(0, 5, 126, 246, 365),
                   labels = c("Practice","First120","Middle","Last120"))

result <- DT[,j=list(mean_accuracy = mean(accuracy),
                     sd_accuracy = sd(accuracy)
                     )
             , by = fc_trial]
print(result)

#    fc_trial mean_accuracy sd_accuracy
# 1: Practice 0.6000000   0.5070926
# 2: First120 0.5151515   0.5004602
# 3:   Middle 0.5833333   0.4936928
# 4:  Last120 0.4677871   0.4996615

Upvotes: 1

Brandon Bertelsen
Brandon Bertelsen

Reputation: 44648

I find it good practice to create a variable that describes the subset and store it with my data for future use. You'll thank yourself later for being able to reproduce large parts of your analysis (bonus points to yourself for naming variables in a manner that has intrinsic meaning to you)

First, let's create a basic factor based on your criteria and append it to your dataset:

mydata$trialsplit <- cut(mydata$trial,c(0,5,126,246,365), 
                    labels=c("Practice","First120","Middle","Last120")

I'm also a fan of the plyr package so I would use this in a manner similar to Maiasaura. If you just need a summary table, you can do the following:

library(ddply)
ddply(mydata, .(trialsplit), summarize, 
      mean_condition = mean(condition),
      sd_condition = sd(condition),
      mean_accuracy = mean(accuracy),
      sd_accuracy = sd(accuracy)
)

If you'd like to append the information to your data instead of generating a summary you change the word "summarize" to "transform".

Stat testing your data after saving the cut variable now becomes quite easy as well:

# Does accuracy change from the first 120 to the last 120 trials?

t.test(mydata$accuracy[mydata$trialsplit == "First120"],
       mydata$accuracy[mydata$trialsplit == "Last120"])

Upvotes: 1

Carlos Cinelli
Carlos Cinelli

Reputation: 11597

You can subset with inequalities:

## creating data for demonstration purposes

demo.data <- data.frame(participant.id = c(rep(101, 365), rep(102, 365), rep(103, 365)),
                        trial = c(1:365, 1:365, 1:365),
                        accuracy = rbinom(365*3, 1, 0.5))

## getting the first 120 trials
data.sub120 <- demo.data[demo.data$trial>5 & demo.data$trial<126,]

##getting the last 120 trials
data.sub120last <- demo.data[demo.data$trial>245 & demo.data$trial<366,]

##taking the means
mean(data.sub120$accuracy)
mean(data.sub120last$accuracy)

Upvotes: 1

Maiasaura
Maiasaura

Reputation: 32986

library(plyr)

# ddply takes a data.frame, splits by a variable, applies a fn,
# and returns everything back to a data.frame
results <- ddply(data.sub, .(participant_id), function(x) {
     # order the data by trial number
     x <- arrange(x, trial)
     # Take rows 6-25, and only columns 3 and 4 
     # since they are the only numeric ones in your example above, 
     # and apply the mean function to each column
     # turn that into a data.frame
     result <- data.frame(t(apply(x[6:125, c(3,4)], 2, mean)))
     # add the participant ID
     result$participant_id <- unique(x$participant_id)
     result
    })

Upvotes: 1

Related Questions