Speed up or replace for loop:Group reshape and row bind

Question

Good evening guys,I want to combine the same id row into one row,with additional column,and here is my part of data.

sample=structure(list(crsp_fundno = c(18021, 18021, 18021, 18021, 22436, 
                                      22436, 22436, 22436, 22436, 22436, 49805, 49805, 49805, 55603, 
                                      55603, 93362), seq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 
                                                             1L, 2L, 3L, 1L, 2L, 1L), begdt = structure(c(13513, 14298, 15027, 
                                                                                                          16149, 12417, 13969, 14910, 14918, 15042, 15644, 14782, 14910, 
                                                                                                          15544, 15505, 15531, 17571), class = "Date"), enddt = structure(c(14297, 
                                                                                                                                                                            15026, 16148, 17621, 13968, 14909, 14917, 15041, 15643, 17621, 
                                                                                                                                                                            14909, 15543, 17621, 15530, 17621, 17621), class = "Date"), crsp_obj_cd = c("EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", 
                                                                                                                                                                                                                                                        "EDYG", "EF", "EF", "EF", "EDYB", "EDYB", "M"), lipper_class = c("MLGE", 
                                                                                                                                                                                                                                                                                                                         "MCCE", "MCVE", "MLCE", "MLVE", "MLVE", "MLCE", "MLVE", "MLCE", 
                                                                                                                                                                                                                                                                                                                         "MLVE", "IMLC", "IMLG", "IMLC", "MTAM", "MTAC", "MATJ"), lipper_obj_cd = c("G", 
                                                                                                                                                                                                                                                                                                                                                                                                    "G", "G", "G", "G", "G", "G", "G", "G", "G", "IF", "IF", "IF", 
                                                                                                                                                                                                                                                                                                                                                                                                    "GI", "GI", "I"), lipper_asset_cd = c("EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", 
                                                                                                                                                                                                                                                                                                                                                                                                                                          "EQ")), class = "data.frame", row.names = c(NA, -16L))

I tried to merge row which has the same ID in one row, and here is my code.

temp=list()
dn=unique(sample$crsp_fundno)
for(i in 1:length(dn) ){
  part=sample[which(sample$crsp_fundno %in% dn[i]),]
  part=reshape(part,idvar='crsp_fundno',timevar='seq',direction='wide')
  temp[[i]]=part
}

library(plyr)
sum=rbind.fill(temp[[1]],temp[[2]])

for (i in 3 :length(dn)){sum=rbind.fill(sum,temp[[i]])}

Code works ,but too low in my whole data(94000 obs almost take 2 hours).

I think I should not heavily rely on for loop in large data set.

May anyone know how can I improve the code or my logic ?

Thanks for your help.

Vitali Avagyan · Accepted Answer

So, using reshape is correct, however, the implementation is not ideal. The function is already optimised to convert between long and wide formats without the need for any for loop.

You only need to call it once and save time:

library(reshape2)
sum <- reshape(sample,direction = "wide",idvar = "crsp_fundno",timevar = "seq")

As you correctly had a hunch, reshape is able to smoothly change between formats.

In your case you have:

crsp_fundno is a variable in long format that identifies multiple records from the same group
seq is the variable in long format that differentiates multiple records from the same group

Speed up or replace for loop:Group reshape and row bind

Answers (2)

Related Questions