Reputation: 63

R - keep all columns but only select rows that meet multiple criteria from multiple columns

I have many rows of eye-tracking data (longer fixations and shorter saccades) in my dataframe, each row corresponds to a sample taken by the eye-tracker (see column timestamp). The data proceeds progressively, with counts of fix and sac, their durations and whether each eye movement hits the targets (AOIs). Here’s a sample:

subj    timestamp   fix_indx    sac_indx    eventtype   dur AOI_1   AOI_2   AOI_3
MT  409 1   NA  Fix 83  NA  NA  NA
MT  426 1   NA  Fix 83  NA  NA  NA
MT  443 NA  1   Sac 17  NA  NA  NA
MT  459 NA  1   Sac 17  NA  NA  NA
MT  476 2   NA  Fix 100 0   NA  NA
MT  493 2   NA  Fix 100 0   NA  NA
MT  509 2   NA  Fix 100 0   NA  NA
MT  526 NA  2   Sac 10  NA  NA  NA
MT  543 NA  NA  Unclas  20  NA  NA  NA
MT  559 3   NA  Fix 233 1   NA  NA
MT  576 3   NA  Fix 233 1   NA  NA
MT  593 3   NA  Fix 233 1   NA  NA
MT  609 NA  3   Sac 11  1   NA  NA
MT  626 4   NA  Fix 240 NA  1   NA
MT  643 4   NA  Fix 240 NA  1   NA
MT  643 4   NA  Fix 240 NA  1   NA
MT  659 4   NA  Fix 240 NA  1   NA
MT  676 NA  4   Sac 13  NA  NA  0
MT  693 5   NA  Fix 250 NA  NA  1
MT  709 5   NA  Fix 250 NA  NA  1
MT  726 5   NA  Fix 250 NA  NA  1
MT  743 5   NA  Fix 250 NA  NA  1
MT  809 NA  5   Sac 9   NA  NA  0
MT  826 6   NA  Fix 256 NA  NA  0
MT  842 6   NA  Fix 256 NA  NA  0
MT  859 6   NA  Fix 256 NA  NA  0

I would like to select only those parts of the dataset where in each AOI column the values are 0 (= outside target) and 1 (= inside target), i.e. eliminate the rows where any AOI column has NA (all columns have mostly NAs with some 1s and 0s). I have 2 conditions to meet per AOI column. I have used the subset function: df1<-subset(df, + AOI_1 ==0 | AOI_1 ==1 | + AOI_2 ==0 | AOI_2 ==1)

However, there are 118 different AOI columns in my dataset, so using subset would mean ending up with an extremely long code string. Is there a way of doing this more neatly?

I run AntoniosK example but I get an error: mydfnew$uid <- 1:nrow(mydfnew) # creates a user id colum (uid) > mydfnew %>% + select(uid, starts_with("AOI")) %>% + filter(complete.cases(.)) %>% + select(uid) %>% + inner_join(mydfnew, by="uid") Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : object 'uid' not found.

This is odd as the uid column was created correctly, as confirmed by tail(mydfnew) and mydfnew$uid... anyone knows what is going on here?

Upvotes: 2

Answers (3)

Jaap

Reputation: 83215

If you don't know which columns are the AOI-columns or whether there are columns in between it, it is better to create an index of the names of the AOI columns in combination with complete.cases as follows (in base R):

AOIcols <- names(df)[grepl("^AOI",names(df))]
df[complete.cases(df[,AOIcols]),]

Or with na.omit from the development version of data.table which has a cols argument:

library(data.table) #v1.9.5+
setDT(df)
na.omit(na.omit(df, names(dt)[grepl("^AOI",names(df))]))

In the data presented in the question, there are no rows for which the complete.cases statement is true. Applying the above code on the small dataset as given by @AntoniosK gives the same result:

  id subj AOI_1 AOI_2
1  1    A     0     1
2  2    A     0     0

An alternative dplyr approach:

library(dplyr)
df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id")

From what I understand is that the dataset contains eye-tracking data. As such datasets can be quite big, lets try the different methods on a somewhat bigger dataset that mimics the dataset in the question:

# load the need packages
library(dplyr)
library(data.table)

# create a datatable
DT <- data.table(id=seq_len(1e5),
                 subj=rep(LETTERS[1:10],each=1e4),
                 dur=rnorm(1e5, mean=70, sd=10),
                 event=sample(c("fix","sac"), 1e5, TRUE, c(3,1)),
                 matrix(sample(c(0,0,1,1,1,0,0,1,0,NA), 1e7, TRUE, prob=c(10,10,10,10,10,10,10,10,10,1)), ncol=100))

names(DT) <- gsub("V","AOI_",names(DT))

# create an identical dataframe
DF <- copy(DT)
setDF(DF)

# the benchmarks
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = na.omit(DT, names(DT)[grepl("^AOI",names(DT))]),
          nico = DF[complete.cases(DF[,5:ncol(DF)]),],
          ant1 = DF %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(DF, by="id"),
          ant2 = {DF %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) -> IDs
            DF[IDs$id,]
          })

  test elapsed relative
1 jaap   0.467    1.000
4 ant2   1.093    2.340
3 ant1   1.191    2.550
2 nico   1.430    3.062

Upvotes: 1

AntoniosK

Reputation: 16121

Assuming that your "AOI" colums can have only 0, 1, NA as values you can try something like the following (simple) example. The only thing you need to do (in advance) is to create an id for your rows and have it as a column:

library(dplyr)

# example dataset
dt = data.frame(id = 1:5,
                subj = "A",
                AOI_1 = c(0,0,1,NA,NA),
                AOI_2 = c(1,0,NA,0,1))

dt

#   id subj AOI_1 AOI_2
# 1  1    A     0     1
# 2  2    A     0     0
# 3  3    A     1    NA
# 4  4    A    NA     0
# 5  5    A    NA     1


dt %>% 
  select(id, starts_with("AOI")) %>%  # keep only id and columns that start with "AOI"
  filter(complete.cases(.)) %>%       # keep only rows that don't have NAs
  select(id) %>%                      # select the ids of those rows
  inner_join(dt, by="id")             # join back info for those rows only

#   id subj AOI_1 AOI_2
# 1  1    A     0     1
# 2  2    A     0     0

Run the example step by step to see how it works and then do any modifications needed. You can create the row id within the process if you prefer.

The inner join at the end is not really needed, as you can just simply spot the rows without NAs in the AOI columns and just select those, instead of joining back. It will be faster if you have to deal with millions of rows:

dt %>% 
  select(id, starts_with("AOI")) %>% 
  filter(complete.cases(.)) %>% 
  select(id) -> IDs

dt[IDs$id,]

@Jaap the benchmarks seem to change (a bit) every time I run them. I'll post some examples based on your dataset:

> # the benchmarks
>  system.time(na.omit(dt, names(dt)[grepl("^AOI",names(dt))]))
   user  system elapsed 
      0       0       0 
> 
>  system.time(df[complete.cases(df[,3:ncol(df)]),])
   user  system elapsed 
   0.02    0.00    0.02 
>  
>  system.time(df %>% select(id, subj, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id"))
   user  system elapsed 
   0.02    0.00    0.02 
> 
>  system.time(df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(df, by="id"))
   user  system elapsed 
   0.02    0.00    0.01

    >  system.time(na.omit(dt, names(dt)[grepl("^AOI",names(dt))]))
   user  system elapsed 
   0.01    0.00    0.02 
> 
>  system.time(df[complete.cases(df[,3:ncol(df)]),])
   user  system elapsed 
   0.02    0.00    0.01 
>  
>  system.time(df %>% select(id, subj, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id"))
   user  system elapsed 
   0.02    0.00    0.01 
> 
>  system.time(df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(df, by="id"))
   user  system elapsed 
   0.03    0.00    0.04

Upvotes: 0

nico

Reputation: 51640

Assuming all the AOI columns are at the end, you can select them using something like:

my.data[,7:ncol(my.data)]

Now you can use complete.cases to select those that don't have NAs.

complete.cases(my.data[,7:ncol(my.data)])

Upvotes: 0

R - keep all columns but only select rows that meet multiple criteria from multiple columns

Answers (3)

Related Questions