frank
frank

Reputation: 3608

alternative to subsetting in R

I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.

I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.

Here is my code:

srt=NULL
for(i in doubletCounts$Var1){
    s=subset(YearHT,YearHT$berthlet==i)
    e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
    srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)

doubletCounts is 700 x 3 df, and each of the values is found within the large df.

I would be glad to hear any ideas to optimize/speed up this process.

Upvotes: 0

Views: 302

Answers (2)

rafa.pereira
rafa.pereira

Reputation: 13817

Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.

# load library
  library(datat.table)

# convert your dataset into data.table
  setDT(YearHT)

# subset YearHT keeping values that are present in doubletCounts$Var1
  YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]

# aggregate values 
  output <-   YearHT_df[ , .( median= median(berthtime)) ]

Upvotes: 2

Benjamin
Benjamin

Reputation: 17279

for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.

srt = NULL
for (i in index)
{
  [stuff]
  srt = rbind(srt, [stuff])
}

is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do

srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
    s=subset(YearHT,YearHT$berthlet==i)
    srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)

Or the apply alternative of

srt = lapply(doubletCounts$Var1,
       function(i)
       {
          s=subset(YearHT,YearHT$berthlet==i)
          unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
       }
)

Both of those should run at about the same speed

(Note: both are untested, for lack of data, so they might be a little buggy)

Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to

unlist(c(strsplit(i, '\\|'),
         median(YearHT[YearHT$berthlet == i, "berthtime"])))

But I'm not sure how much time that would save.

Upvotes: 0

Related Questions