Reputation: 65
Good evening guys,I want to combine the same id row into one row,with additional column,and here is my part of data.
sample=structure(list(crsp_fundno = c(18021, 18021, 18021, 18021, 22436,
22436, 22436, 22436, 22436, 22436, 49805, 49805, 49805, 55603,
55603, 93362), seq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L,
1L, 2L, 3L, 1L, 2L, 1L), begdt = structure(c(13513, 14298, 15027,
16149, 12417, 13969, 14910, 14918, 15042, 15644, 14782, 14910,
15544, 15505, 15531, 17571), class = "Date"), enddt = structure(c(14297,
15026, 16148, 17621, 13968, 14909, 14917, 15041, 15643, 17621,
14909, 15543, 17621, 15530, 17621, 17621), class = "Date"), crsp_obj_cd = c("EDYG",
"EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG", "EDYG",
"EDYG", "EF", "EF", "EF", "EDYB", "EDYB", "M"), lipper_class = c("MLGE",
"MCCE", "MCVE", "MLCE", "MLVE", "MLVE", "MLCE", "MLVE", "MLCE",
"MLVE", "IMLC", "IMLG", "IMLC", "MTAM", "MTAC", "MATJ"), lipper_obj_cd = c("G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "IF", "IF", "IF",
"GI", "GI", "I"), lipper_asset_cd = c("EQ", "EQ", "EQ", "EQ",
"EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ", "EQ",
"EQ")), class = "data.frame", row.names = c(NA, -16L))
I tried to merge row which has the same ID in one row, and here is my code.
temp=list()
dn=unique(sample$crsp_fundno)
for(i in 1:length(dn) ){
part=sample[which(sample$crsp_fundno %in% dn[i]),]
part=reshape(part,idvar='crsp_fundno',timevar='seq',direction='wide')
temp[[i]]=part
}
library(plyr)
sum=rbind.fill(temp[[1]],temp[[2]])
for (i in 3 :length(dn)){sum=rbind.fill(sum,temp[[i]])}
Code works ,but too low in my whole data(94000 obs almost take 2 hours).
I think I should not heavily rely on for loop in large data set.
May anyone know how can I improve the code or my logic ?
Thanks for your help.
Upvotes: 0
Views: 48
Reputation: 260
I strongly recommend data.table
when it comes to speed and memory efficiency.
setDT(sample) # in place, no assignment needed
sum3 <- dcast(sample,
crsp_fundno ~ seq,
value.var = names(sample)[3:8])
Here is a comparison of the OP
's for loop
, the reshape
way suggested in one of the answers and the data.table
way suggested in this post:
Unit: milliseconds
expr min lq mean median uq max neval cld
for loop 23.735154 24.190626 25.948536 24.722330 26.176343 42.764253 100 c
reshape 6.448800 6.742147 7.196820 6.850390 7.379401 9.932432 100 b
data.table 1.928812 2.143367 2.362979 2.255964 2.447935 5.847116 100 a
Upvotes: 1
Reputation: 1203
So, using reshape
is correct, however, the implementation is not ideal. The function is already optimised to convert between long
and wide
formats without the need for any for
loop.
You only need to call it once and save time:
library(reshape2)
sum <- reshape(sample,direction = "wide",idvar = "crsp_fundno",timevar = "seq")
As you correctly had a hunch, reshape
is able to smoothly change between formats.
In your case you have:
crsp_fundno
is a variable in long
format that identifies multiple records from the same groupseq
is the variable in long
format that differentiates multiple records from the same groupUpvotes: 1