alternative to subsetting in R

Question

I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.

I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.

Here is my code:

srt=NULL
for(i in doubletCounts$Var1){
    s=subset(YearHT,YearHT$berthlet==i)
    e=unlist(c(strsplit(i,'\|'),median(s$berthtime)))
    srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)

doubletCounts is 700 x 3 df, and each of the values is found within the large df.

I would be glad to hear any ideas to optimize/speed up this process.

rafa.pereira · Accepted Answer

Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.

# load library
  library(datat.table)

# convert your dataset into data.table
  setDT(YearHT)

# subset YearHT keeping values that are present in doubletCounts$Var1
  YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]

# aggregate values 
  output <-   YearHT_df[ , .( median= median(berthtime)) ]

alternative to subsetting in R

Answers (2)

Related Questions