Reputation: 17611
I am trying to modify one list based information contained in another list. The essence of the problem is that one list list1
contains data frames--two columns each, column 1 (time
) = time instants, column 2 (score
) is filled with 0s--and the other list list2
contains data frames--two columns containing paired time instances (e.g. 0.12 and 0.125 or 1.54 and 1.57), or the start and end points of a time window. Each data frame in list1
has a corresponding (i.e. same name) data frame in list2
The goal is to figure out which time instants (i.e. from the time
column) from each data frame in list1
do fall in between any of the time windows contained in the associated data frame in list2
by modifying the 'score' column from data frames in list1
(the one with all zeros) if its associated time instant is between at least one of the paired times from the corresponding (i.e. same name) data frame from list2
. The end result is essentially list1
with 0s in the score
column distinguishing time instants that do not fall in any window from the time instants that do.
Here is some example data:
set.seed(1)
list1 <- split(d<-replicate(10,1:100+rnorm(100,0,0.1)),
ceiling(seq_along(d)/100))
list1 <- lapply(list1, function(x) data.frame(cbind(time = x, score = 0)))
names(list1) <- letters[1:10]
list2 <- replicate(10, sample(1:100, sample(1:20), replace=FALSE))
list2 <- lapply(list2, function(x)
data.frame(cbind(x, x + sample(runif(100,min=0.1,max=3),length(x)))))
names(list2) <- letters[1:10]
I was able to cobble together a solution or two that work for small examples, but when I try it on my much larger list (i.e. several million time instants in the real list1
), I run into memory errors.
First, I made a function to do what I want given two appropriate data frames
testfxn1 <-function(df1, df2)
{
df1<-lapply(1:dim(df2)[1], function(x)
{
df1[which(df1[1] > df2[x,1] & df1[1] < df2[x,2]), 2] <- 1
return(df1)
})
return(cbind(df1[[1]][1],
score = rowSums(do.call(cbind,lapply(df1,'[[',2)))))
}
Then, I use sapply
to apply the function to the whole list:
sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))
It does what I want (i.e. leaves 0s where the data are not between a relevant time window), but in my real data there are many data frames with 250,000 - 750,000 time instants in list1
and I am unable to allocate enough memory to complete the task.
Any thoughts on how to accomplish this more efficiently? Surely, the first lapply
call in my testfxn1
must be a big part of the problem.
Upvotes: 1
Views: 134
Reputation: 17611
@HillarySanders answer gave me an idea.
For my purposes, this answer yields the same relevant information as testfxn1
, though there is a post processing step necessary here to join the output here with the original list1
time instant data.
testfxn2 <- function(df1, df2)
{
sapply(df1$time, function(g)
{
any(g > df2[,1] & g < df2[,2])
})
}
Then sapply
it:
sapply(names(list1), simplify=FALSE, function(x) testfxn2(list1[[x]], list2[[x]]))
Upvotes: 1
Reputation: 887291
Not sure if this will be faster.
res <- setNames(lapply(names(list1), function(x) {
x1 <- list2[[x]]
x2 <- list1[[x]][, 1]
x3 <- t(replicate(length(x2), x1[, 1]))
x4 <- t(replicate(length(x2), x1[, 2]))
data.frame(time = x2, score = rowSums(x2 > x3 & x2 < x4))
}), names(list1))
s1 <- sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))
identical(res, s1)
#[1] TRUE
Another way would be:
library(data.table)
x1 <- rbindlist(list1)
x2 <- rbindlist(list2)
#slower
s3 <- Vectorize(function(x) x1[, time] > x)(x2[, x]) & Vectorize(function(y) x1[,
time] < y)(x2[, V2])
indx <- rep(names(list2), sapply(list2, dim)[1, ])
indx2 <- seq(1, nrow(x1), by = 100)
lst1 <- split(seq_len(ncol(s3)), indx)
res1 <- setNames(lapply(seq_along(indx2), function(i) data.frame(time = list1[[i]][,
1], score = rowSums(s3[indx2[i]:(indx2[i] + 99), lst1[[i]]]))), names(list1))
identical(res,res1)
#[1] TRUE
Upvotes: 1
Reputation: 75205
There may be better approaches, yet, but for sure a significant part of the memory crunch is with the many copies of df1
(each with a few ones rather than zeros for the score
value, where appropriate). An alternative approach is to use a single copy of df1
where we increment the corresponding entry whenever the time matches. Furthermore this approach saves the need of recreating df1 with sums, at the end.
In other words:
# alternative to original tesetfxn1() function
# The idea is to increment the value in the qualifying rows in-situ rather
# than producing as many copies of df1 as there are rows in df2 and having to
# sum these up at the end.
testfxn1 <-function(df1, df2)
{
for (x in 1:nrow(df2)) # I find this more explicit that 1:dim(df2)[1]
{
# Get "list" of qualifying rows
selectRows <- which(df1[1] > df2[x,1] & df1[1] < df2[x,2])
# Increment the corresponding row's score value
if (length(selectRows) > 0) {
df1[selectRows, 2] <- df1[selectRows, 2] + 1
# or more explicitly...
# df1[selectRows, ]$score <- df1[selectRows, ]$score + 1
}
}
df1
}
Upvotes: 1
Reputation: 6047
This might be a little faster, if I'm understanding the problem correctly. You can at least remove one set of loops through vectorization.
windows = rbind(list2[[1]], list2[[2]])
# for each time, look at all windows and see if the time t falls outsize every single window:
list1[[1]]$score = sapply(list1[[1]]$time, FUN=function(t){
all(t > windows[,1] | t < windows[,2])
})
# same thing for the second dataframe in list1
# TRUE = in a window, FALSE = not in a window. Use as.numeric() to coerce to 0's and 1's.
Upvotes: 1