user20412
user20412

Reputation: 193

r: a case where there seems to be no alternative to a loop

I have a dataset where I have time series for many trials. On each trial, a participant might look at the target picture (trg), a competitor (cmp) or a distractor. Trials are different lengths. This little code snippet creates a sample time series.

sbj <- c(rep("s1",6),rep("s2",8))
trial <- c(rep(1,4),rep(2,2),rep(1,3),rep(2,5))
trg <- c(rep(0,3),1,0,1,c(rep(0,2),1,0,0,0,1,1))
cmp <- c(rep(0,3),0,1,0,c(rep(0,2),0,0,0,1,0,0))
dis <- c(rep(1,3),0,0,0,c(rep(1,2),0,1,1,0,0,0))
time<-c(seq(1,4),seq(1,2),seq(1,3),seq(1,5))
df<-data.frame(sbj,trial,time,trg,cmp,dis)
df

Data frame looks like this:

#   sbj trial time trg cmp dis
1   s1     1    1   0   0   1
2   s1     1    2   0   0   1
3   s1     1    3   0   0   1
4   s1     1    4   1   0   0
5   s1     2    1   0   1   0
6   s1     2    2   1   0   0
7   s2     1    1   0   0   1
8   s2     1    2   0   0   1
9   s2     1    3   1   0   0
10  s2     2    1   0   0   1
11  s2     2    2   0   0   1
12  s2     2    3   0   1   0
13  s2     2    4   1   0   0
14  s2     2    5   1   0   0

Now what I want to do is create records where the values for trg, cmp, and dis are their sums per trial within subjects -- how many frames the participant looked at them -- and another where this is converted to the proportion of time steps that each object was looked at. For example, for the first subject's first trial, there are 4 time steps. The target is fixated for 1 time step, so its sum would be 1, and its proportion would 0.25. The results I am looking for would be like this for sums:

#  sbj trial trgSum cmpSum disSum
1  s1     1      1      0      3
2  s1     2      1      1      0
3  s2     1      1      0      2
4  s2     2      2      1      2

And like this for proportions:

#  sbj trial trgProp cmpProp disProp
1  s1     1    0.25     0.0    0.75
2  s1     2    0.50     0.5    0.00
3  s2     1    0.33     0.0    0.67
4  s2     2    0.40     0.2    0.40

This is easy enough to achieve, looping through all unique combinations of subject and trial. But in the real dataset, there are hundreds of time steps per trial for hundreds of trials for dozens of subjects, so looping takes a very long time. Can anyone suggest a way to do this that avoids loops?

Thank you!

** EDIT ** I have a follow-up question, which reveals my weak R skills. The actual data frame has some additional factors. For example, if we modify the df to have a couple other factors:

grp <- c(rep("g1",6), rep("g2",8))
cnd <- c(rep("c1",4),rep("c2",2),rep("c1",3),rep("c4",5))
#
sbj <- c(rep("s1",6),rep("s2",8))
trial <- c(rep(1,4),rep(2,2),rep(1,3),rep(2,5))
trg <- c(rep(0,3),1,0,1,c(rep(0,2),1,0,0,0,1,1))
cmp <- c(rep(0,3),0,1,0,c(rep(0,2),0,0,0,1,0,0))
dis <- c(rep(1,3),0,0,0,c(rep(1,2),0,1,1,0,0,0))
time<-c(seq(1,4),seq(1,2),seq(1,3),seq(1,5))
df<-data.frame(sbj,grp,cnd,trial,time,trg,cmp,dis)
df

The aggregate and dplyr approaches hit errors due to there being factors in the df OR manage to apply a form of 'sum' to the variables that doesn't make sense. The data.table solution works, but drops the grp and cnd columns. Is there a way to make it work and then somehow merge it back with appropriate grp and cnd values?

Thanks!

Upvotes: 2

Views: 83

Answers (4)

Parfait
Parfait

Reputation: 107577

Consider using aggregate() function (no looping needed) and creating proportions by division of two dataframes:

sumdf <- aggregate(.~sbj+trial, df, FUN = sum) 
lendf <- aggregate(.~sbj+trial, df, FUN = length)

# DIVIDE NUMERIC COLUMN FROM BOTH DFS
propdf <- cbind(sumdf[,c(1:2)],
                round(sumdf[,c(4:6)] / lendf[,c(4:6)],2))
# ORDER BY SBJ, TRIAL
propdf <- propdf[with(propdf, order(sbj, trial)),]

Upvotes: 2

Heroka
Heroka

Reputation: 13139

For completeness, here is how you could do this in data.table:

library(data.table)

setDT(df)

dat_sums <- df[,lapply(.SD,sum), by = c("sbj","trial"),.SDcols=c("trg","cmp","dis")]

dat_props <- df[,lapply(.SD,function(x){sum(x)/length(x)}), by=c("sbj","trial"), .SDcols=c("trg","cmp","dis")]

Upvotes: 3

bramtayl
bramtayl

Reputation: 4024

Here is a way to do it with dplyr and tidyr.

library(dplyr)
library(tidyr)

grouped_df = 
  df %>%
  group_by(sbj, trial)

totals = 
  grouped_df %>%
  summarise_each(funs(sum))

proportions = 
  grouped_df %>%
  summarise_each(funs(mean))

You can put these together in long form or wide form.

long = 
  list("Sum" = totals,
       "Prop" = proportions) %>%
  bind_rows(.id = "summarize_function")

wide = 
  long %>%
  gather(variable, value, time:dis) %>%
  unite(new_variable, variable, summarize_function, sep = "") %>%
  spread(new_variable, value)

Upvotes: 2

Dan Lewer
Dan Lewer

Reputation: 956

I think you can do this with aggregate:

a <- aggregate(cbind(trg, cmp, dis) ~ sbj + trial, data = df, FUN = sum)
x <- aggregate(rep(1, nrow(df)) ~ sbj + trial, data = df, FUN = sum)[,3]
b <- cbind(a[,1:2], a[,3:5]/x)

a
b

The order of the results is slightly different to yours, but that's easy to change if you want.

Upvotes: 2

Related Questions