lokheart
lokheart

Reputation: 24675

plyr in R very slow during merging

I am using plyr package in R to do the following:

I have made the progress bar to show the progress, but after it shows to 100% it seems to be still running, as I have see my CPU is still occupied by RGUI, but it just doesn't end.

My table A is having about 40000 rows of data with unique column A and column B.

I suspect that the "combine" part of the "split-conquer-combine" workflow in plyr cannot handle this 40000 rows of data, because I can do it for another table with 4000 rows of data.

Any suggestions for improving the efficiency? Thanks.

UPDATE

Here is my code:

for (loop.filename in (1:nrow(filename)))
  {print("infection source merge")
   print(filename[loop.filename, "table_name"])
   temp <- get(filename[loop.filename, "table_name"])
   temp1 <- ddply(temp,
                  c("HOSP_NO", "REF_DATE"),
                  function(df)
                    {temp.infection.source <- abcde[abcde[,"Case_Number"]==unique(df[,"HOSP_NO"]) &
                                              abcde[,"Reference_Date"]==unique(df[,"REF_DATE"]),
                                              "Case_Definition"]
                     if (length(temp.infection.source)==0) {
                         temp.infection.source<-"NIL"
                         } else {
                         if (length(unique(temp.infection.source))>1) {
                             temp.infection.source<-"MULTIPLE"
                             } else {
                            temp.infection.source<-unique(temp.infection.source)}}
                     data.frame(df,
                                INFECTION_SOURCE=temp.infection.source)
                     },
                    .progress="text")
   assign(filename[loop.filename, "table_name"], temp1)
  }

Upvotes: 3

Views: 500

Answers (1)

Joris Meys
Joris Meys

Reputation: 108583

If I understood correctly what you're trying to achieve, this should do what you want, pretty quick, and without too much memory loss.

#toy data
A <- data.frame(
    A=letters[1:10],
    B=letters[11:20],
    CC=1:10
)

ord <- sample(1:10)
B <- data.frame(
    A=letters[1:10][ord],
    B=letters[11:20][ord],
    CC=(1:10)[ord]
)
#combining values
A.comb <- paste(A$A,A$B,sep="-")
B.comb <- paste(B$A,B$B,sep="-")
#matching
A$DD <- B$CC[match(A.comb,B.comb)]
A

This applies only if the combinations are unique. If they're not, you'll have to take care of that first. Without the data it's quite impossible to know what you're trying to achieve exactly in your complete function, but you should be able to port the logic given here to your own case.

Upvotes: 2

Related Questions