Merge data.table by two variables and closest third

Question

I would like to perform an operation on data.tables, that I can currently, successfully do with data.frames. Essentially, it is a merge function of two data.frames, that finds the closest match in df2 for df1 for one of many matching variables. This code is below.

I would like to do this in data.tables, because my data.frames are very large, and my current setup crashes if I try to complete this operation on the full data. Data.table, might allow me to do it outright on the full set, but if not, I find data.table easier to work with when using multiple subsets of data.

I am looking for the Id (and its corresponding value) from df2 that has the closest match to a States value in df1 by the variables MM and variable (in this data.frame method, multiple pairings can occur if the there is a closest match tie (e.g. a value at both plus 1 and minus 1 exists)). When using data.frames I get the solution as final below. I don't know how to set up data.table to give me the same result. I have tried variation of my keys, one example is below. There is an answer using data.table in the data.frame question I reference in the code, however, I can not get it to work with my example data.

# data.frame method
# used info from this thread: https://stackoverflow.com/questions/16095680
df1 <- structure(list(State = structure(c(1L, 1L, 3L, 3L, 2L, 2L, 1L, 
1L, 1L), .Label = c("AK", "CO", "MS"), class = "factor"), MM = c(1L, 
2L, 1L, 2L, 3L, 4L, 3L, 4L, 2L), variable = structure(c(1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"), 
    value = c(1L, 2L, 3L, 4L, 2L, 3L, 5L, 6L, 7L)), .Names = c("State", 
"MM", "variable", "value"), class = "data.frame", row.names = c(NA, 
-9L))
df2 <- structure(list(Id = c(1L, 2L, 3L, 1L, 2L, 3L, 5L, 6L, 7L, 5L, 
6L, 7L, 8L), MM = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 
4L, 5L), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"), 
    value = c(1, 2, 3, 2, 3, 4, 2, 3, 5.5, 6.5, 3.5, 2.5, 8)), .Names = c("Id", 
"MM", "variable", "value"), class = "data.frame", row.names = c(NA, 
-13L))

#Find rows that match by x and y
res <- merge(df1, df2, by = c("MM", "variable"), all.x = TRUE)

res$dif <- abs(res$value.x - res$value.y)

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ MM + variable, data = res, FUN = min), res)

#Finally merge the result back into df1
final <- merge(df1, res1[res1$dif <= 1, c("MM", "variable", "State", "Id", "value.y")], all.x = TRUE)

### one Data.table attempts
# create data.tables with the same key columns
keycols1 = c("MM", "variable", "value")
df1t <- data.table(df1, key = keycols1)
df2t <- data.table(df2, key = key(df1t))
setkey(df1t, value)
setkey(df2t, value)
test.final <- df2t[df1t, roll='nearest', allow.cartesian=TRUE]

ytsaig · Accepted Answer

Not sure if this is the best way to achieve what you want, but here is one approach that is similar to what you do with data frames, only using data.tables instead:

dt1 <- data.table(df1)
dt2 <- data.table(df2)
res <- merge(dt1, dt2, by = c("MM", "variable"), all.x = TRUE, allow.cartesian=TRUE)
final_dt <- res[, .SD[abs(value.x - value.y) == min(abs(value.x - value.y))], by=c("State", "MM", "variable")]

Note that the result in final_dt differs from your result in final for (State=AK, MM=3, variable=TMX), where your approach above does not return a match even though according to your description a match should be returned.

Merge data.table by two variables and closest third

Answers (1)

Related Questions