Reputation: 13051
First import:
import pandas as pd
import numpy as np
import hashlib
Next, consider the following:
np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1
.
However, if we consider the following:
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Note that there are strings in the data now. The hash of the arr
is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de
) for different evaluations, but the one of the pandas.DataFrame
changes each time.
Any pythonic way around it? No Pythonic?
Edit: Related links:
Upvotes: 3
Views: 3924
Reputation: 4471
A pandas DataFrame
or Series
can be hashed using the pandas.util.hash_pandas_object
function, starting in version 0.20.1.
Upvotes: 2
Reputation: 1
I wrote a package with hashable subclasses of Series
and DataFrame
for my needs. Hope this helps.
Upvotes: 0
Reputation: 13051
Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:
print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())
Naturally, this is going to be very length for big dataframes.
Still, the it remains that pd.DataFrame(arr).values != arr
, and this is counter-intuitive.
See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691
Upvotes: 0
Reputation: 1
According to me when you are using string as values for your cells. Data frame type is object
df.dtypes
shows that. That is why you get different hash each time.
Upvotes: 0