Reputation: 8626
I have a data.table myDT
, and I'm making "copies" of this table by 3 different ways:
myDT <- data.table(colA = 1:3)
myDT[colA == 3]
copy1 <- copy(myDT)
copy2 <- myDT # yes I know that it's a reference, not real copy
copy3 <- myDT[,.(colA)] # I list all columns from the original table
Then I'm comparing those copies with the original table:
identical(myDT, copy1)
# TRUE
identical(myDT, copy2)
# TRUE
identical(myDT, copy3)
# FALSE
I was trying to figure out what was the difference between myDT
and copy3
identical(names(myDT), names(copy3))
# TRUE
all.equal(myDT, copy3, check.attributes=FALSE)
# TRUE
all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)
# TRUE
attr.all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE)
# NULL
all.equal(myDT, copy3)
# [1] "Attributes: < Length mismatch: comparison on first 1 components >"
attr.all.equal(myDT, copy3)
# [1] "Attributes: < Names: 1 string mismatch >"
# [2] "Attributes: < Length mismatch: comparison on first 3 components >"
# [3] "Attributes: < Component 3: Attributes: < Modes: list, NULL > >"
# [4] "Attributes: < Component 3: Attributes: < names for target but not for current > >"
# [5] "Attributes: < Component 3: Attributes: < current is not list-like > >"
# [6] "Attributes: < Component 3: Numeric: lengths (0, 3) differ >"
My original question was how to understand the last output. Finally I came to using the attributes()
function:
attr0 <- attributes(myDT)
attr3 <- attributes(copy3)
str(attr0)
str(attr3)
it has shown that original data.table
had an index
attribute which was not copied when I created copy3
.
Upvotes: 3
Views: 481
Reputation: 92292
In order to make this question a bit clearer (and maybe useful for future readers), what really happened here is that you (probably not) set a secondary key while explicitly calling set2key
, OR, data.table
seemingly set a secondary key while you were making some ordinary operations such as filtering. This is a (not so) new feature added in V 1.9.4
DT[column==value] and DT[column %in% values] are now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==value] is much faster. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).
Lets reproduce this
myDT <- data.table(A = 1:3)
options(datatable.verbose = TRUE)
myDT[A == 3]
# Creating new index 'A' <~~~~ Here it is
# forder took 0 sec
# Coercing double column i.'V1' to integer to match type of x.'A'. Please avoid coercion for efficiency.
# Starting bmerge ...done in 0 secs
# A
# 1: 3
attr(myDT, "index") # or using `key2(myDT)`
# integer(0)
# attr(,"__A")
# integer(0)
So, unlike you were assuming, you actually did create a copy and thus the secondary key wasn't transferred with it. Compare
copy1 <- myDT
attr(copy1, "index")
# integer(0)
# attr(,"__A")
# integer(0)
copy2 <- myDT[,.(A)]
# Detected that j uses these columns: A <~~~ This is where the copy occures
attr(copy2, "index")
# NULL
identical(myDT, copy1)
# [1] TRUE
identical(myDT, copy2)
# [1] FALSE
And for some further validation
tracemem(myDT)
# [1] "<00000000159CBBB0>"
tracemem(copy1)
# [1] "<00000000159CBBB0>"
tracemem(copy2)
# [1] "<000000001A5A46D8>"
The most interesting conclusion here, one could claim, that [.data.table
does create a copy, even if the object remains unchanged.
Upvotes: 4