Compare matching entries in large data frames

Question

Suppose I have two data frames with the following general structure:

A=data.frame(ID=c(1,1,2,3,6, 10), Obs=c(0,5,6,7,3,-4))
B=data.frame(ID=c(1,3,2,4,8), Obs=c(10,-5,NA,7,NA))

For matching ID's I want to report:

Entries in A that are NA's in B, or
Entries for which the sign on the column "Obs" flips.

There are, however, a couple of complications:

Some IDs are not unique. They are also not ordered.
Not all ID's exist in both data frames, and the data frames are not the same length.
If IDs are not unique, but the Obs. in that row is 0, the comparison should be run against the row with the non-zero obs.
Some entries are NA's.

So far, using R, I've parsed the data frames using a loop and IF-statements. E.g. some of my code would look something like this:

results.signflip <- data.frame()
results.missingvalue <- data.frame()
Intersection.ID<- intersect(A$ID, B$ID)

for (idx.row in 1:length(Intersection.ID)) {
 idx.selection.A   <- grep(paste0("^", Intersection.ID[idx.row]), A$ID)
 idx.selection.B   <- grep(paste0("^", Intersection.ID[idx.row]), B$ID)

 if ( sign(!A[idx.row, "Obs"] == sign(B[idx.row, "Obs"] )) 
   results.signflip <- rbind(results.signflip, A[idx.row,])

 (... more IF statements...)

}

This is obviously a simple and not very efficient way to tackle this problem. Trouble is, the file has some 70.000 entries, and the script runs for hours.

So, my question is: does anyone have a smart idea for some really efficient code?

Compare matching entries in large data frames

Answers (1)

Related Questions