Saaru Lindestøkke
Saaru Lindestøkke

Reputation: 2564

How can I save a pandas dataframe that is backwards compatible with older pandas versions?

Context

I work with python 3.9.6 and pandas 1.3.0.
My colleague works with python 3.6.12 and pandas 1.1.5.
I want to create a dataframe and share it with my colleague, without asking them to update their environment (that request would incur some hassle).

Question

How can I write out a dataframe to a file using my newer python/pandas versions in a way that their older python/pandas versions can read it in as a dataframe?

What I've tried or looked into

Default .to_pickle() method
If in the newer python environment I write:

df.to_pickle(r"C:\somepath\file.bz2")

and in the older python environment I try:

df.read_pickle(r"C:\somepath\file.bz2")

I get:

ValueError: unsupported pickle protocol: 5

Specifying a protocol version in the .to_pickle() method
Fine, I thought, I'll specify a different protocol.

df.to_pickle(r"C:\somepath\file.bz2", protocol=3)

However, if in the older python environment I try to load it I get

AttributeError: module 'pandas.core.internals.blocks' has no attribute 'new_block'

This error remains for all protocol versions from 0 to 5.

Previous question on protocol version
I found this question, which only has the answer that the pandas versions must match.
I find it hard to believe that's the only solution, as then, what's the point of having multiple pickle protocols which are meant to be backward compatible?

Previous question on the new_block attribute
This question mentions the same error with the missing new_block attribute. Again, the answer is to update the pandas version (over which I have no control at the moment).

Downgrading the newer python/pandas versions
I could downgrade my newer python/pandas to match my colleague's versions.
Haven't tried it yet, but I assume that should work. However, that would really be a last resort, as then I would need a special "low version" environment to work with this one colleague.

Exporting to CSV
This works, but it loses some dataframe specific features like data types and NaN values, so I don't consider this a valid workaround.

Pickling separately
I thought maybe the issue lies in the pandas .to_pickle() or .read_pickle() method, so I tried using the pickle library directly to write the file (using protocol 3):

import pickle

with open('file.pkl', 'wb') as f:
    pickle.dump(df, f, 3)

... and then read it in the older python environment:

import pickle

with open('file.pkl', 'rb') as f:
    df = pickle.load(f)

Unfortunately, I am still met with

AttributeError: module 'pandas.core.internals.blocks' has no attribute 'new_block'

Converting to a dict, then pickling that
Per the suggestion in the comments I tried:

ddf = df.to_dict()

with open('file.pkl', 'wb') as f:
    pickle.dump(ddf, f, 3)

But then, when I try to read it in the older environment, I get:

AttributeError: Can't get attribute '_unpickle_timestamp' on <module 'pandas._libs.tslibs.timestamps

My DataFrame has a timestamp column in it, which apparently cannot be unpickled by the older pandas version.

Upvotes: 5

Views: 2194

Answers (1)

iacob
iacob

Reputation: 24261

Why the above don't work

what's the point of having multiple pickle protocols which are meant to be backward compatible?

These protocols are designed for different ways different versions of Python read pickled files. They do not convert objects built with a recent version of a specific library into a version compatible with an older version.

it's only a minor semantic versioning difference and that should yield "functionality in a backwards compatible manner".

You're misunderstanding this, it means that code written with the earlier version will still function as expected using the new version. It does not mean new functions introduced in the recent version will work in the older one (or objects created using them).

Resolving version conflicts with a venv

If this is a shared project with multiple collaborators your development should be in a virtual environment where you can load the exact requisites of the project (in terms of both python version and library/version requirements), so as to not run into conflicts with your global python install.

Both you and your colleague can work from within your venv's with full confidence that you are using compatible libraries and functionality.

It is very straightforward to set up, effectively you just create a new folder with its own python install, and then any libraries installed from within the venv are stored there. This local version of python only sees these libraries. This is what a requirements.txt file for a project does - it defines the libraries and version requirements of the project.

When you are done with it you can easily delete the folder.


Steps:

  1. Create a virtual environment named e.g. /my/env:
    python -m venv /my/env --upgrade-deps
    
  2. Activate your venv (the specific command depends on your OS).
  3. Install your project dependencies to the venv:
    pip install -r requirements.txt
    

You can easily create a requirements.txt file like so:

pip install pipreqs

pipreqs /path/to/project

Manual solution

You could manually change the datatype of the column to one recognised by the earlier version of pandas (e.g. freetext) before pickling it.

Upvotes: 1

Related Questions