Reputation: 113
In the following example I use a dplyr::arrange
on a data.table
with a key. This overrides the sort on that column:
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")
# lose order on datatable key
x <- dplyr::arrange(x, b)
y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")
res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)
# merge results are inconsistent
identical(res, res2)
I can see that if I ordered with x <- x[order(b)]
, I would maintain the sort on the key and the results would be consistent.
I am not sure why I cannot use dplyr::arrange
and what relationship the sort key has with the merge. Any insight would be appreciated.
Upvotes: 1
Views: 114
Reputation: 971
The problem is that with dplyr::arrange(x, b)
you do not remove the sorted
attribute from your data.table
contrary to using x <- x[order(b)]
or setorder(x, "b")
.
The data.table
way would be to use setorder
in the first place e.g.
library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)
The wrong results of joins on data.table
s which have a key
although they are not sorted by it, is a known bug (see also #5361 in data.table
bug tracker).
Upvotes: 1