Reputation: 3574
I planned to simplify this question but I decided to just post my original code for clarity. I'm trying to create a list score_loc
that has 1874 list elements and pulls data from three different sets of data: loc_pre
,loc_locate
, and csv_t
. To create this list I'm inefficiently using a for loop to assign dataframes to each list element, which is extremely slow, given that the data is very large, and giving me errors.
Reproducible Data
Shortened csv_t.csv contains first 20000 rows
As for loc_locate, its a little difficult to show a reproducible example for a list of dataframes.
Some previously determined data:
head(loc_pre) # 1874 rows
# start end
# 1 4844 4852
# 2 5954 5962
# 3 7896 7904
# 4 12301 12309
# 5 18553 18561
# 6 18670 18678
loc_locate # a list of varying lengths of dataframes; 1874 list elements
# [[1]]
# start end
# [1,] 6 6
#
# [[2]]
# start end
# [1,] 1 1
# [2,] 6 6
# [3,] 9 9
#
# [[3]]
# start end
# [1,] 6 6
# [2,] 8 8
head(csv_t) # 4524203 rows, tpl column values are consecutively increasing by 1
# tpl score
# 1: 3239 6
# 2: 3240 6
# 3: 3241 7
# 4: 3242 13
# 5: 3243 0
# 6: 3244 6
Desired output:
You can see that the row number of loc_pre
corresponds with the list element number of loc_locate
. loc_locate
indicates the position number with respect to the corresponding starting position in loc_pre
. For example if you take the first element of loc_locate
and first row of loc_pre
, you can tell that you are looking for 6th position in 4844, 4845, 4846, 4847, 4848, 4849, 4850, 4851, 4852. In this case this desired position is 4849.
Following this line of logic, I want to create a new list score_loc
of 1874 list elements that would show me the start, end, and score of those desired positions for each separate row of loc_pre
. The score column would be from csv_t.
score_loc
# [[1]]
# start end score
# [1,] 6 6 10 # score corresponding to position (4844 + 6 - 1)
#
# [[2]]
# start end score
# [1,] 1 1 1 # score corresponding to position (5954 + 1 - 1)
# [2,] 6 6 2 # score corresponding to position (5954 + 6 - 1)
# [3,] 9 9 8 # score corresponding to position (5954 + 9 - 1)
#
# [[3]]
# start end score
# [1,] 6 6 19 # score corresponding to position (7896 + 6 - 1)
# [2,] 8 8 11 # score corresponding to position (7896 + 8 - 1)
My Code
As I mentioned before, I'm using a for loop to try to accomplish this, but this method is taking way too long. I hope that you can get a clearer idea of what I am trying to accomplish by looking at my code.
score_loc <- list()
for(w in 1:nrow(loc_pre)){
vectornom <- loc_pre[w, 1] + loc_locate[[w]][,"start"] - 1
score_loc[[w]] <- data.frame(csv_t[csv_t$tpl %in% vectornom,][, 4, with=F]) # takes a long time for some reason
}
Upvotes: 0
Views: 104
Reputation: 386
One way to do it is to use the mapply
function:
# Expand the sequences
preList <- apply(loc_pre, 1, function(X) X[1]:X[2])
# Function to build tpl datasets
posFun <- function(Seq, Loc) {
cbind(Loc, tpl = apply(Loc, 1, function(X, S) S[X[1]:X[2]], S = Seq))
}
# Apply, combine and merge
mOutput <- mapply(posFun, preList, loc_locate)
mIndex <- rep(1:length(mOutput), sapply(mOutput, nrow)) # Not sure if you need this, but have included for now
combineData <- data.frame(Index = mIndex, do.call("rbind", mOutput))
merge(combineData, csv_t, all.x = TRUE)
Looking at the data samples, it seems we can simplify this to:
posFun <- function(Seq, Loc) cbind(Loc, tpl = Seq + Loc[,1] - 1)
mOutput <- mapply(posFun, loc_pre$start, loc_locate)
merge(do.call("rbind", mOutput), csv_t, all.x = TRUE)
# tpl start end score
# 1 4849 6 6 6
# 2 5954 1 1 4
# 3 5959 6 6 7
# 4 5962 9 9 6
# 5 7901 6 6 2
# 6 7903 8 8 1
Note: I've randomly generated my scores in this example
Upvotes: 1