screechOwl
screechOwl

Reputation: 28169

R convert data frame to input file - improve performance

I'm trying to convert a data frame from R to a text file.

The data set is ~ 1500 x 700 and it takes a while to loop thru the dataframe and I'm wondering if there's any way to speed up the process.

My data frame is like this:

>train2
score   x1    x2    x3     x4     x5 ...  x700
  0     0      1     1      1     0        0
  1     0      1     0      0     0        0
  0     1      0     1      1     1        0
  3     0      1     1      1     0        0
  1     0      1     0      1     0        0
  2     1      1     1      1     0        1
  0     0      1     1      0     0        0
 ...    .      .     .      .     .        .

In the created file I only include cells that are non-zero.

So the output for row 1-3 would be:

0 | x2:1 x3:1 x4:1
1 | x2:1
0 | x1:1 x3:1 x4:1

My current code runs like this:

pt1 <- paste(train2$score," | ",sep="")
  collect1 <- c()
  for(j in 1:nrow(train2)){
    word1 <- pt1[j]
    for(i in 10:ncol(train2)){
      if(train2[j,i] !=0){
        word1 <- paste(word1,colnames(train2)[i],":",train2[j,i], " ", sep="")                        
      }      
    }  
    collect1 <- c(collect1, word1)
    if(j %% 100 == 0){
      print(j);flush.console()    
      gc()
    }    
  }

Each run takes ~ 3-4 minutes. Is there anything obvious to improve the performance?

EDIT: after the loops are completed, the resulting data frame collect1 is used to create a text file using:

  write(collect1, file="outPut1.txt")

Upvotes: 1

Views: 240

Answers (1)

Dan Gerlanc
Dan Gerlanc

Reputation: 417

Try vectoring the operation as follows (I put 'score' in a separate variable and removed it from 'train3' so I wouldn't have to subset the data frame in the anonymous function):

score  <- train2$score
train3 <- train2[, -1]
cols   <- colnames(train3)
res <- apply(train3, 1, function(x) {
  idx  <- x != 0
  nms  <- cols[idx]
  vals <- x[idx]
  paste(nms, vals, sep=":", collapse=" ")
})

out <- paste(score, "|", as.vector(res))
print(out)

Upvotes: 4

Related Questions