Chon Kit Hui
Chon Kit Hui

Reputation: 65

Speed up or replace for loop:Group reshape and row bind

Good evening guys,I want to combine the same id row into one row,with additional column,and here is my part of data.

sample=structure(list(crsp_fundno = c(18021, 18021, 18021, 18021, 22436, 
                                      22436, 22436, 22436, 22436, 22436, 49805, 49805, 49805, 55603, 
                                      55603, 93362), seq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 
                                                             1L, 2L, 3L, 1L, 2L, 1L), begdt = structure(c(13513, 14298, 15027, 
                                                                                                          16149, 12417, 13969, 14910, 14918, 15042, 15644, 14782, 14910, 
                                                                                                          15544, 15505, 15531, 17571), class = "Date"), enddt = structure(c(14297, 
                                                                                                                                                                            15026, 16148, 17621, 13968, 14909, 14917, 15041, 15643, 17621, 
                                                                                                                                                                            14909, 15543, 17621, 15530, 17621, 17621), class = "Date"), crsp_obj_cd = c("EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EF", "EF", "EF", "EDYB", "EDYB", "M"), lipper_class = c("MLGE", 
                                                                                                                                                                                                                                                                                                                         "MCCE", "MCVE", "MLCE", "MLVE", "MLVE", "MLCE", "MLVE", "MLCE", 
                                                                                                                                                                                                                                                                                                                         "MLVE", "IMLC", "IMLG", "IMLC", "MTAM", "MTAC", "MATJ"), lipper_obj_cd = c("G", 
                                                                                                                                                                                                                                                                                                                                                                                                    "G", "G", "G", "G", "G", "G", "G", "G", "G", "IF", "IF", "IF", 
                                                                                                                                                                                                                                                                                                                                                                                                    "GI", "GI", "I"), lipper_asset_cd = c("EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ")), class = "data.frame", row.names = c(NA, -16L))

I tried to merge row which has the same ID in one row, and here is my code.

temp=list()
dn=unique(sample$crsp_fundno)
for(i in 1:length(dn) ){
  part=sample[which(sample$crsp_fundno %in% dn[i]),]
  part=reshape(part,idvar='crsp_fundno',timevar='seq',direction='wide')
  temp[[i]]=part
}

library(plyr)
sum=rbind.fill(temp[[1]],temp[[2]])

for (i in 3 :length(dn)){sum=rbind.fill(sum,temp[[i]])}

Code works ,but too low in my whole data(94000 obs almost take 2 hours).

I think I should not heavily rely on for loop in large data set.

May anyone know how can I improve the code or my logic ?

Thanks for your help.

Upvotes: 0

Views: 48

Answers (2)

haci
haci

Reputation: 260

I strongly recommend data.table when it comes to speed and memory efficiency.

setDT(sample) # in place, no assignment needed
sum3 <- dcast(sample,
              crsp_fundno ~ seq,
              value.var = names(sample)[3:8])

Here is a comparison of the OP's for loop, the reshape way suggested in one of the answers and the data.table way suggested in this post:

Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval cld
   for loop 23.735154 24.190626 25.948536 24.722330 26.176343 42.764253   100   c
    reshape  6.448800  6.742147  7.196820  6.850390  7.379401  9.932432   100  b 
 data.table  1.928812  2.143367  2.362979  2.255964  2.447935  5.847116   100 a  

Upvotes: 1

Vitali Avagyan
Vitali Avagyan

Reputation: 1203

So, using reshape is correct, however, the implementation is not ideal. The function is already optimised to convert between long and wide formats without the need for any for loop.

You only need to call it once and save time:

library(reshape2)
sum <- reshape(sample,direction = "wide",idvar = "crsp_fundno",timevar = "seq")

As you correctly had a hunch, reshape is able to smoothly change between formats.

In your case you have:

  1. crsp_fundno is a variable in long format that identifies multiple records from the same group
  2. seq is the variable in long format that differentiates multiple records from the same group

Upvotes: 1

Related Questions