Dror
Dror

Reputation: 13051

Hashing Pandas dataframe breaks

First import:

import pandas as pd
import numpy as np
import hashlib

Next, consider the following:

np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())

Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1.

However, if we consider the following:

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())

Note that there are strings in the data now. The hash of the arr is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de) for different evaluations, but the one of the pandas.DataFrame changes each time.

Any pythonic way around it? No Pythonic?

Edit: Related links:

Upvotes: 3

Views: 3924

Answers (4)

dspencer
dspencer

Reputation: 4471

A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.

Upvotes: 2

Bao Wei
Bao Wei

Reputation: 1

I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.

Upvotes: 0

Dror
Dror

Reputation: 13051

Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:

print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())

Naturally, this is going to be very length for big dataframes.

Still, the it remains that pd.DataFrame(arr).values != arr, and this is counter-intuitive.

See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691

Upvotes: 0

Ankush Khanna
Ankush Khanna

Reputation: 1

According to me when you are using string as values for your cells. Data frame type is object

df.dtypes

shows that. That is why you get different hash each time.

Upvotes: 0

Related Questions