Reputation: 173
I have two dataframes of ~150 rows of X
and Y
where identical(X, Y)
is TRUE
but identical(digest(X), digest(Y))
is FALSE
. I'm looking into why this is the case.
I did look at this answer and re-ran what they tested, with similar results, but unlike their problem, the attributes for my dataframes are the same. Testing results:
> names(attributes(X))
[1] "names" "row.names" "class"
> names(attributes(Y))
[1] "names" "row.names" "class"
> digest(X)
[1] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y)
[1] "09d8abcab0af0a72265a9b690f4eacc3"
> digest(X[1:nrow(X),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> digest(Y[1:nrow(Y),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> identical(X, Y, attrib.as.set=FALSE)
[1] TRUE
I also saved the dataframes as .RDS files, and re-read them in.
> X_rds <- read_rds("cache_vars/X.rds")
> Y_rds <- read_rds("cache_vars/Y.rds")
> identical(X_rds , Y_rds )
[2] TRUE
> digest(X_rds)
[2] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y_rds )
[2] "09d8abcab0af0a72265a9b690f4eacc3"
> identical(X_rds , Y_rds , attrib.as.set=FALSE)
[2] TRUE
And like the other poster, converting to matrices and back to dataframe yielded identical digests, so it's probably some structural problem.
> X_Mat <- as.matrix(X_rds)
> Y_Mat <- as.matrix(Y_rds)
> identical(digest(X_Mat), digest(Y_Mat))
[2] TRUE
> X_DF <- as.data.frame(X_Mat)
> Y_DF <- as.data.frame(Y_Mat)
> identical(digest(X_DF ), digest(Y_DF))
[2] TRUE
Dataframe X was produced from a parallel-designed loop (but with the %do% flag so no actual parallelism was done) and Y was produced from a sequential loop.
The .RDS files for X and Y can be found at this link.
Update:
MrFlick has it right. As it turns out, the serialization during parallel's rbind function was also adding the gp=0x20
flag, similar to what they described occurs when writing to RDS.
Upvotes: 2
Views: 63
Reputation: 206197
When you write to rds
, the objects are serialized. The serialization contains some information in addition to just the values the vectors contain. Note that if we just compare all the columns, they produce a different digests
sapply(seq_along(X_rds), function(i)
digest::digest(X_rds[[i]])==digest::digest(Y_rds[[i]])
)
So the vectors that are being stored in the data.frame are different. We can use the internal inspect
function to get some of the meta-data for the vectors
.Internal(inspect(X_rds[[1]]))
# @135305a00 14 REALSXP g0c7 [REF(4),gp=0x20] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
.Internal(inspect(Y_rds[[1]]))
# @115dbfc00 14 REALSXP g0c7 [REF(29)] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
So we see they differ in the []
parts. I believe the REF()
number represents the reference count to that object for memory clearing purposes. I do not believe that this number is used in the serialization. But the X_rds
also has gp=0x20
set. The "gp" stands for "general purpose" bits/flags. I believe in this case it means the GROWABLE_MASK was set on that object. These values are preserved when the object is serialized which is the default behavior for digest
. Thus these vectors do not have the exact same serialization due to this flag difference.
Another way to see the difference is to look at the desrialization
substring(rawToChar(serialize(X_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n131086\n150\n1009002\n"
substring(rawToChar(serialize(Y_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n14\n150\n1009002\n1009"
We have a a bit of a header, then we start to see the values being output. There is one value where there is a difference and that's where X has 131086 (0x20000E) and Y has 14 (0xE). Those differences are due to the flags where are written here in the R source code.
When you use identical
, only the values in the data.frame are compared, not the additional metadata.
If you wanted to get around this, you could write your own wrapper around digest
that avoids the serialization. For example
dfdigest <- function(x) {
charsToRaw <- function(x) unlist(lapply(x, charToRaw))
bytes <- unlist(c(list(charsToRaw(names(x))),
lapply(x, function(col) {
if (typeof(col)=="double") writeBin(col, raw())
else if (typeof(col)=="character") charsToRaw(col)
else stop(paste("unconfigured data type:", typeof(col)))
})))
digest::digest(bytes, serialize = FALSE)
}
dfdigest(X_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
dfdigest(Y_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
Upvotes: 2