Reputation: 63
I have many rows of eye-tracking data (longer fixations and shorter saccades) in my dataframe, each row corresponds to a sample taken by the eye-tracker (see column timestamp
). The data proceeds progressively, with counts of fix and sac, their durations and whether each eye movement hits the targets (AOIs). Here’s a sample:
subj timestamp fix_indx sac_indx eventtype dur AOI_1 AOI_2 AOI_3
MT 409 1 NA Fix 83 NA NA NA
MT 426 1 NA Fix 83 NA NA NA
MT 443 NA 1 Sac 17 NA NA NA
MT 459 NA 1 Sac 17 NA NA NA
MT 476 2 NA Fix 100 0 NA NA
MT 493 2 NA Fix 100 0 NA NA
MT 509 2 NA Fix 100 0 NA NA
MT 526 NA 2 Sac 10 NA NA NA
MT 543 NA NA Unclas 20 NA NA NA
MT 559 3 NA Fix 233 1 NA NA
MT 576 3 NA Fix 233 1 NA NA
MT 593 3 NA Fix 233 1 NA NA
MT 609 NA 3 Sac 11 1 NA NA
MT 626 4 NA Fix 240 NA 1 NA
MT 643 4 NA Fix 240 NA 1 NA
MT 643 4 NA Fix 240 NA 1 NA
MT 659 4 NA Fix 240 NA 1 NA
MT 676 NA 4 Sac 13 NA NA 0
MT 693 5 NA Fix 250 NA NA 1
MT 709 5 NA Fix 250 NA NA 1
MT 726 5 NA Fix 250 NA NA 1
MT 743 5 NA Fix 250 NA NA 1
MT 809 NA 5 Sac 9 NA NA 0
MT 826 6 NA Fix 256 NA NA 0
MT 842 6 NA Fix 256 NA NA 0
MT 859 6 NA Fix 256 NA NA 0
I would like to select only those parts of the dataset where in each AOI column the values are 0 (= outside target) and 1 (= inside target), i.e. eliminate the rows where any AOI column has NA
(all columns have mostly NAs with some 1s and 0s). I have 2 conditions to meet per AOI column. I have used the subset
function:
df1<-subset(df,
+ AOI_1 ==0 | AOI_1 ==1 |
+ AOI_2 ==0 | AOI_2 ==1)
However, there are 118 different AOI columns in my dataset, so using subset
would mean ending up with an extremely long code string. Is there a way of doing this more neatly?
I run AntoniosK example but I get an error:
mydfnew$uid <- 1:nrow(mydfnew)
# creates a user id colum (uid
)
> mydfnew %>%
+ select(uid, starts_with("AOI")) %>%
+ filter(complete.cases(.)) %>%
+ select(uid) %>%
+ inner_join(mydfnew, by="uid")
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) :
object 'uid' not found
.
This is odd as the uid
column was created correctly, as confirmed by tail(mydfnew)
and mydfnew$uid
... anyone knows what is going on here?
Upvotes: 2
Views: 2633
Reputation: 83215
If you don't know which columns are the AOI-columns or whether there are columns in between it, it is better to create an index of the names of the AOI columns in combination with complete.cases
as follows (in base R):
AOIcols <- names(df)[grepl("^AOI",names(df))]
df[complete.cases(df[,AOIcols]),]
Or with na.omit
from the development version of data.table
which has a cols
argument:
library(data.table) #v1.9.5+
setDT(df)
na.omit(na.omit(df, names(dt)[grepl("^AOI",names(df))]))
In the data presented in the question, there are no rows for which the complete.cases
statement is true. Applying the above code on the small dataset as given by @AntoniosK gives the same result:
id subj AOI_1 AOI_2
1 1 A 0 1
2 2 A 0 0
An alternative dplyr
approach:
library(dplyr)
df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id")
From what I understand is that the dataset contains eye-tracking data. As such datasets can be quite big, lets try the different methods on a somewhat bigger dataset that mimics the dataset in the question:
# load the need packages
library(dplyr)
library(data.table)
# create a datatable
DT <- data.table(id=seq_len(1e5),
subj=rep(LETTERS[1:10],each=1e4),
dur=rnorm(1e5, mean=70, sd=10),
event=sample(c("fix","sac"), 1e5, TRUE, c(3,1)),
matrix(sample(c(0,0,1,1,1,0,0,1,0,NA), 1e7, TRUE, prob=c(10,10,10,10,10,10,10,10,10,1)), ncol=100))
names(DT) <- gsub("V","AOI_",names(DT))
# create an identical dataframe
DF <- copy(DT)
setDF(DF)
# the benchmarks
library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
jaap = na.omit(DT, names(DT)[grepl("^AOI",names(DT))]),
nico = DF[complete.cases(DF[,5:ncol(DF)]),],
ant1 = DF %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(DF, by="id"),
ant2 = {DF %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) -> IDs
DF[IDs$id,]
})
test elapsed relative
1 jaap 0.467 1.000
4 ant2 1.093 2.340
3 ant1 1.191 2.550
2 nico 1.430 3.062
Upvotes: 1
Reputation: 16121
Assuming that your "AOI" colums can have only 0, 1, NA as values you can try something like the following (simple) example. The only thing you need to do (in advance) is to create an id for your rows and have it as a column:
library(dplyr)
# example dataset
dt = data.frame(id = 1:5,
subj = "A",
AOI_1 = c(0,0,1,NA,NA),
AOI_2 = c(1,0,NA,0,1))
dt
# id subj AOI_1 AOI_2
# 1 1 A 0 1
# 2 2 A 0 0
# 3 3 A 1 NA
# 4 4 A NA 0
# 5 5 A NA 1
dt %>%
select(id, starts_with("AOI")) %>% # keep only id and columns that start with "AOI"
filter(complete.cases(.)) %>% # keep only rows that don't have NAs
select(id) %>% # select the ids of those rows
inner_join(dt, by="id") # join back info for those rows only
# id subj AOI_1 AOI_2
# 1 1 A 0 1
# 2 2 A 0 0
Run the example step by step to see how it works and then do any modifications needed. You can create the row id within the process if you prefer.
The inner join at the end is not really needed, as you can just simply spot the rows without NAs in the AOI columns and just select those, instead of joining back. It will be faster if you have to deal with millions of rows:
dt %>%
select(id, starts_with("AOI")) %>%
filter(complete.cases(.)) %>%
select(id) -> IDs
dt[IDs$id,]
@Jaap the benchmarks seem to change (a bit) every time I run them. I'll post some examples based on your dataset:
A)
> # the benchmarks
> system.time(na.omit(dt, names(dt)[grepl("^AOI",names(dt))]))
user system elapsed
0 0 0
>
> system.time(df[complete.cases(df[,3:ncol(df)]),])
user system elapsed
0.02 0.00 0.02
>
> system.time(df %>% select(id, subj, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id"))
user system elapsed
0.02 0.00 0.02
>
> system.time(df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(df, by="id"))
user system elapsed
0.02 0.00 0.01
B)
> system.time(na.omit(dt, names(dt)[grepl("^AOI",names(dt))]))
user system elapsed
0.01 0.00 0.02
>
> system.time(df[complete.cases(df[,3:ncol(df)]),])
user system elapsed
0.02 0.00 0.01
>
> system.time(df %>% select(id, subj, starts_with("AOI")) %>% filter(complete.cases(.)) %>% semi_join(df, by="id"))
user system elapsed
0.02 0.00 0.01
>
> system.time(df %>% select(id, starts_with("AOI")) %>% filter(complete.cases(.)) %>% select(id) %>% inner_join(df, by="id"))
user system elapsed
0.03 0.00 0.04
Upvotes: 0
Reputation: 51640
Assuming all the AOI columns are at the end, you can select them using something like:
my.data[,7:ncol(my.data)]
Now you can use complete.cases
to select those that don't have NAs.
complete.cases(my.data[,7:ncol(my.data)])
Upvotes: 0