user3464299
user3464299

Reputation: 25

Add column to datasets using loop

I wrote the following code to extract multiple datasets out of one large dataset based on the column Time.

for(i in 1:nrow(position)) {
  assign(paste("position.",i,sep=""), subset(dataset, Time >= position[i,1] & Time <= position[i,2])
  )
}

(position is a list which contains the starttime[,1] and stoptime[,2])

The outputs are subsets of my original dataset and looke like:

position.1
position.2
position.3
....

Is there a possibility to add an extra column to each of the new datasets (position.1, position.2, ...) Which defines them by a number?

eg: position.1 has an extra column with value 1, position.2 has an extra column with value 2, and so on.

I need those numbers to identify the datasets (position.1, position.2, ...) after I rbind them in a last step to on dataset again.

Upvotes: 1

Views: 280

Answers (2)

Michael Lawrence
Michael Lawrence

Reputation: 1021

In addition to Thomas's recommendation to avoid side effects, you might want to take advantage of existing packages that detect overlaps. The IRanges package in Bioconductor can detect overlaps between one set of ranges (position) and another set of ranges or positions (dataset$Time). This gets you the matches between the time points and the ranges:

r <- IRanges(position[[1L]], position[[2L]])
hits <- findOverlaps(dataset$Time, r)

Now, you want to extract a subset of the dataset that overlaps each range in position. We can group the query (Time) indices by the subject (position) indices and extract a list from the dataset using that grouping:

dataset <- DataFrame(dataset)
l <- extractList(dataset, split(queryHits(hits), subjectHits(hits)))

To get the final answer, we need to combine the list elements row-wise, while adding a column that denotes their group membership:

ans <- stack(l)

Upvotes: 1

Thomas
Thomas

Reputation: 44565

Since you don't provide example data, this is untested, but should work for you:

dflist <- 
lapply(1:nrow(position), function(x) {
    within(dataset[dataset$Time >= position[x,1] & dataset$Time <= position[x,2],], val = x)
}
do.call(rbind, dflist)

Basically, you never want to take the strategy you propose of assigning multiple numbered objects to the global environment. It is much easier to store all of the subsets in a list and then bind them back together using do.call(rbind, dflist). This is more efficiently, produces less clutter in your workspace, and is a more "functional" style of programming.

Upvotes: 1

Related Questions