alki
alki

Reputation: 3574

R creating new list from other lists/dataframes

I planned to simplify this question but I decided to just post my original code for clarity. I'm trying to create a list score_loc that has 1874 list elements and pulls data from three different sets of data: loc_pre,loc_locate, and csv_t. To create this list I'm inefficiently using a for loop to assign dataframes to each list element, which is extremely slow, given that the data is very large, and giving me errors.

Reproducible Data

Shortened csv_t.csv contains first 20000 rows

csv_t.csv dropbox link

loc_pre.csv dropbox link

As for loc_locate, its a little difficult to show a reproducible example for a list of dataframes.

Some previously determined data:

head(loc_pre)  # 1874 rows 
#   start   end
# 1  4844  4852
# 2  5954  5962
# 3  7896  7904
# 4 12301 12309
# 5 18553 18561
# 6 18670 18678

loc_locate  # a list of varying lengths of dataframes; 1874 list elements
# [[1]]
#      start end
# [1,]     6   6
#
# [[2]]
#      start end
# [1,]     1   1
# [2,]     6   6
# [3,]     9   9
#
# [[3]]
#      start end
# [1,]     6   6
# [2,]     8   8

head(csv_t)  # 4524203 rows, tpl column values are consecutively increasing by 1
#     tpl score 
# 1: 3239     6 
# 2: 3240     6 
# 3: 3241     7 
# 4: 3242    13 
# 5: 3243     0 
# 6: 3244     6 

Desired output:

You can see that the row number of loc_pre corresponds with the list element number of loc_locate. loc_locate indicates the position number with respect to the corresponding starting position in loc_pre. For example if you take the first element of loc_locate and first row of loc_pre, you can tell that you are looking for 6th position in 4844, 4845, 4846, 4847, 4848, 4849, 4850, 4851, 4852. In this case this desired position is 4849.

Following this line of logic, I want to create a new list score_loc of 1874 list elements that would show me the start, end, and score of those desired positions for each separate row of loc_pre. The score column would be from csv_t.

score_loc
# [[1]]
#      start end score
# [1,]     6   6    10   # score corresponding to position (4844 + 6 - 1)
#
# [[2]]
#      start end score
# [1,]     1   1     1   # score corresponding to position (5954 + 1 - 1)
# [2,]     6   6     2   # score corresponding to position (5954 + 6 - 1)
# [3,]     9   9     8   # score corresponding to position (5954 + 9 - 1)
#
# [[3]]
#      start end score
# [1,]     6   6    19   # score corresponding to position (7896 + 6 - 1)
# [2,]     8   8    11   # score corresponding to position (7896 + 8 - 1)

My Code

As I mentioned before, I'm using a for loop to try to accomplish this, but this method is taking way too long. I hope that you can get a clearer idea of what I am trying to accomplish by looking at my code.

score_loc <- list()
for(w in 1:nrow(loc_pre)){
   vectornom <- loc_pre[w, 1] + loc_locate[[w]][,"start"] - 1
   score_loc[[w]] <- data.frame(csv_t[csv_t$tpl %in% vectornom,][, 4, with=F]) # takes a long time for some reason
}

Upvotes: 0

Views: 104

Answers (1)

RichAtMango
RichAtMango

Reputation: 386

One way to do it is to use the mapply function:

# Expand the sequences
preList <- apply(loc_pre, 1, function(X) X[1]:X[2])

# Function to build tpl datasets
posFun <- function(Seq, Loc) {
  cbind(Loc, tpl = apply(Loc, 1, function(X, S) S[X[1]:X[2]], S = Seq))
}

# Apply, combine and merge    
mOutput <- mapply(posFun, preList, loc_locate)
mIndex <- rep(1:length(mOutput), sapply(mOutput, nrow)) # Not sure if you need this, but have included for now
combineData <- data.frame(Index = mIndex, do.call("rbind", mOutput))
merge(combineData, csv_t, all.x = TRUE)

Looking at the data samples, it seems we can simplify this to:

posFun <- function(Seq, Loc) cbind(Loc, tpl = Seq + Loc[,1] - 1)
mOutput <- mapply(posFun, loc_pre$start, loc_locate)
merge(do.call("rbind", mOutput), csv_t, all.x = TRUE)
#    tpl start end score
# 1 4849     6   6     6
# 2 5954     1   1     4
# 3 5959     6   6     7
# 4 5962     9   9     6
# 5 7901     6   6     2
# 6 7903     8   8     1

Note: I've randomly generated my scores in this example

Upvotes: 1

Related Questions