Reputation: 700
I want to create a persistent scikit-learn model and reference to it later via a hash. Using joblib for serialization, I would expect full (bit-level) integrity if there are no changes in my data. But every time I run the code, the model file on disk has a different hash. Why is that and how can I make really identical serialization every time I run the code unchanged? Setting a fixed seed did not help (not sure if sklearn's algorithm utilizes random numbers in this simple example at all).
import numpy as np
from sklearn import linear_model
import joblib
import hashlib
# set a fixed seed …
np.random.seed(1979)
# internal md5sum function
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
# dummy regression data
X = [[0., 0., 0.,1.], [1.,0.,0.,0.], [2.,2.,0.,1.], [2.,5.,1.,0.]]
Y = [[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]
# create model
reg = linear_model.LinearRegression()
# save model to disk to make it persistent
with open("reg.joblib", "w"):
joblib.dump(reg, "reg.joblib")
# load persistant model from disk
with open("reg.joblib", "r"):
model = joblib.load("reg.joblib")
# fit & predict
reg.fit(X,Y)
model.fit(X,Y)
myprediction1 = reg.predict([[2., 2., 0.1, 1.1]])
myprediction2 = model.predict([[2., 2., 0.1, 1.1]])
# run several times … why does md5sum change everytime?
print(md5("reg.joblib"))
print(myprediction1, myprediction2)
Upvotes: 1
Views: 377
Reputation: 700
After some research I found an answer to my question. The problem with different hashes of the joblib files for each run has nothing to do with scikit-learn or the trained model. In fact, one can demonstrate using joblib.hash(reg)
that the MD5 sum of the pure model is the same, meaning there are no changes in the weights of the trained regression model. This handy function now also solves my original "business" problem.
The root cause of non-reproducible MD5 sums for dumped files lies in the implementation of the underlying pickle
serialization model on which joblib.dump
is based on. The decisive hint came from How to hash a large object (dataset) in Python? . Somewhere in the depth of the internet this old finding provides some background:
Since the pickle data format is actually a tiny stack-oriented programming language, and some freedom is taken in the encodings of certain objects, it is possible that the two modules produce different data streams for the same input objects. However it is guaranteed that they will always be able to read each other's data streams.
Upvotes: 2