Reputation: 17611

R: use one list to modify another list efficiently

I am trying to modify one list based information contained in another list. The essence of the problem is that one list list1 contains data frames--two columns each, column 1 (time) = time instants, column 2 (score) is filled with 0s--and the other list list2 contains data frames--two columns containing paired time instances (e.g. 0.12 and 0.125 or 1.54 and 1.57), or the start and end points of a time window. Each data frame in list1 has a corresponding (i.e. same name) data frame in list2

The goal is to figure out which time instants (i.e. from the time column) from each data frame in list1 do fall in between any of the time windows contained in the associated data frame in list2 by modifying the 'score' column from data frames in list1 (the one with all zeros) if its associated time instant is between at least one of the paired times from the corresponding (i.e. same name) data frame from list2. The end result is essentially list1 with 0s in the score column distinguishing time instants that do not fall in any window from the time instants that do.

Here is some example data:

set.seed(1)
list1 <- split(d<-replicate(10,1:100+rnorm(100,0,0.1)), 
         ceiling(seq_along(d)/100))
list1 <- lapply(list1, function(x) data.frame(cbind(time = x, score = 0)))
names(list1) <- letters[1:10]

list2 <- replicate(10, sample(1:100, sample(1:20), replace=FALSE)) 
list2 <- lapply(list2, function(x) 
         data.frame(cbind(x, x + sample(runif(100,min=0.1,max=3),length(x)))))
names(list2) <- letters[1:10]

I was able to cobble together a solution or two that work for small examples, but when I try it on my much larger list (i.e. several million time instants in the real list1), I run into memory errors.

First, I made a function to do what I want given two appropriate data frames

testfxn1 <-function(df1, df2)
  {
    df1<-lapply(1:dim(df2)[1], function(x)
    {
    df1[which(df1[1] > df2[x,1] & df1[1] < df2[x,2]), 2] <- 1
    return(df1)
    })
    return(cbind(df1[[1]][1], 
           score = rowSums(do.call(cbind,lapply(df1,'[[',2)))))
  }

Then, I use sapply to apply the function to the whole list:

sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))

It does what I want (i.e. leaves 0s where the data are not between a relevant time window), but in my real data there are many data frames with 250,000 - 750,000 time instants in list1 and I am unable to allocate enough memory to complete the task.

Any thoughts on how to accomplish this more efficiently? Surely, the first lapply call in my testfxn1 must be a big part of the problem.

Upvotes: 1

Answers (4)

Jota

Reputation: 17611

@HillarySanders answer gave me an idea.

For my purposes, this answer yields the same relevant information as testfxn1, though there is a post processing step necessary here to join the output here with the original list1 time instant data.

testfxn2 <- function(df1, df2)
  {
  sapply(df1$time, function(g)
    {
    any(g > df2[,1] & g < df2[,2])
    })
  }

Then sapply it:

sapply(names(list1), simplify=FALSE, function(x) testfxn2(list1[[x]], list2[[x]]))

Upvotes: 1

akrun

Reputation: 887291

Not sure if this will be faster.

res <- setNames(lapply(names(list1), function(x) {
         x1 <- list2[[x]]
         x2 <- list1[[x]][, 1]
         x3 <- t(replicate(length(x2), x1[, 1]))
         x4 <- t(replicate(length(x2), x1[, 2]))
            data.frame(time = x2, score = rowSums(x2 > x3 & x2 < x4))
      }), names(list1))

 s1 <- sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))     

 identical(res, s1)
  #[1] TRUE

Another way would be:

 library(data.table)
 x1 <- rbindlist(list1)
 x2 <- rbindlist(list2)

 #slower
 s3 <- Vectorize(function(x) x1[, time] > x)(x2[, x]) & Vectorize(function(y) x1[, 
time] < y)(x2[, V2])
 indx <- rep(names(list2), sapply(list2, dim)[1, ])
 indx2 <- seq(1, nrow(x1), by = 100)
 lst1 <- split(seq_len(ncol(s3)), indx)
 res1 <- setNames(lapply(seq_along(indx2), function(i) data.frame(time = list1[[i]][, 
1], score = rowSums(s3[indx2[i]:(indx2[i] + 99), lst1[[i]]]))), names(list1))

 identical(res,res1)
 #[1] TRUE

Upvotes: 1

mjv

Reputation: 75205

There may be better approaches, yet, but for sure a significant part of the memory crunch is with the many copies of df1 (each with a few ones rather than zeros for the score value, where appropriate). An alternative approach is to use a single copy of df1 where we increment the corresponding entry whenever the time matches. Furthermore this approach saves the need of recreating df1 with sums, at the end.
In other words:

# alternative to original tesetfxn1() function
#   The idea is to increment the value in the qualifying rows in-situ rather
#   than producing as many copies of df1 as there are rows in df2 and having to
#   sum these up at the end.
testfxn1 <-function(df1, df2)
{
    for (x in 1:nrow(df2))  # I find this more explicit that 1:dim(df2)[1]
    {
        # Get "list" of qualifying rows
        selectRows <- which(df1[1] > df2[x,1] & df1[1] < df2[x,2])
        # Increment the corresponding row's score value
        if (length(selectRows) > 0) {
          df1[selectRows, 2] <- df1[selectRows, 2] + 1
          # or more explicitly...
          # df1[selectRows, ]$score <- df1[selectRows, ]$score + 1
        }
    }

    df1
}

Upvotes: 1

Hillary Sanders

Reputation: 6047

This might be a little faster, if I'm understanding the problem correctly. You can at least remove one set of loops through vectorization.

windows = rbind(list2[[1]], list2[[2]])
# for each time, look at all windows and see if the time t falls outsize every single window:
list1[[1]]$score = sapply(list1[[1]]$time, FUN=function(t){
    all(t > windows[,1] | t < windows[,2])
}) 
# same thing for the second dataframe in list1
# TRUE = in a window, FALSE = not in a window. Use as.numeric() to coerce to 0's and 1's.

Upvotes: 1

R: use one list to modify another list efficiently

Answers (4)

Related Questions