Overriding data.table key order causes incorrect merge results

Question

In the following example I use a dplyr::arrange on a data.table with a key. This overrides the sort on that column:

x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")

# lose order on datatable key
x <- dplyr::arrange(x, b)

y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")

res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)

# merge results are inconsistent
identical(res, res2)

I can see that if I ordered with x <- x[order(b)], I would maintain the sort on the key and the results would be consistent.

I am not sure why I cannot use dplyr::arrange and what relationship the sort key has with the merge. Any insight would be appreciated.

Ben373 · Accepted Answer

The problem is that with dplyr::arrange(x, b) you do not remove the sorted attribute from your data.table contrary to using x <- x[order(b)] or setorder(x, "b").

The data.table way would be to use setorder in the first place e.g.

library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)

The wrong results of joins on data.tables which have a key although they are not sorted by it, is a known bug (see also #5361 in data.table bug tracker).

Overriding data.table key order causes incorrect merge results

Answers (1)

Related Questions