da Bich
da Bich

Reputation: 536

Accessing zip compression options in pandas to_csv

I am having trouble finding the compression options available to me.. At the bottom of this page: to_csv they have an example that shows 2 options:

compression_opts = dict(method='zip', archive_name='out.csv')

But I see no listing of all options available.. and can't find one elsewhere. I'd love to see the full list (assuming there are more than these 2)

End goal currently: the zip operation zips the file up in a zip file, but all the folders are also within the zip file, so that the file is actually buried in a bunch of folders within the zip. I'm sure there is an easy option to prevent the folders from being added to the zip...

Upvotes: 2

Views: 2012

Answers (2)

sql_knievel
sql_knievel

Reputation: 1369

I think I understand your question. Let's say I have a dataframe, and I want to save it locally in a zipfile. And let's say I want that zipfile saved at the location somepath/myfile.zip

Let's say I run this program (also assuming that somepath/ is a valid folder in the current working directory):

### with_path.py

import pandas as pd

filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])

compression_options = {"method": "zip"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)

If I list the content of the resulting file, I can see the path I wanted to store the zip file at was ALSO used as the name of the file INSIDE the zip, including the folder structure, and still named .zip even, which is weird:

(.venv) pandas_test % unzip -l somepath/myfile.zip
     
Archive:  somepath/myfile.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       17  09-17-2021 12:56   somepath/myfile.zip
---------                     -------
       17                     1 file

Instead, I can supply an archive_name as a compression option to explicitly provide a name for my file inside the zip. Like so:

### without_path.py

import pandas as pd

filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])

compression_options = {"method": "zip", "archive_name": f"{filename}.csv"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)

Now although our resulting zip file was still written at the desired location of somepath/ the file inside the zip does NOT include the path as part of the filename, and is correctly named with a .csv extension.

(.venv) pandas_test % unzip -l somepath/myfile.zip

Archive:  somepath/myfile.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       17  09-17-2021 12:59   myfile.csv
---------                     -------
       17                     1 file

The strange default behavior doesn't seem to be called out in the documentation, but you can see the use of the archive_name parameter in the final example of the pandas.DataFrame.to_csv documentation. IMHO they should throw an error and force you to provide an archive_name value, because I can't imagine when you would want to name the file inside a zip the exact same as the zip file itself.

Upvotes: 3

Gerd
Gerd

Reputation: 2803

After some reseach about to_csv's integrated the compression mechanisms, I would suggest a different appoach for your problem:

Assuming that you have a number of DataFrames that you want to write to your zip file as individual csv files (for this example I keep the DataFrames in a list, so I can loop over them):

df_list = []
df_list.append(pd.DataFrame({'col1':[1, 2, 3],
                             'col2':['A', 'B', 'C']}))
df_list.append(pd.DataFrame({'col1':[4, 5, 6],
                             'col2':['a', 'b', 'c']}))

Then you can convert each of these DataFrames to a csv string in memory (not a file) and write that string to your zip archive as a file (e.g. df0.csv, df1.csv, ...):

with zipfile.ZipFile(file="out.zip", mode="w") as zf:
    for index, df in enumerate(df_list):
        csv_str = df.to_csv(index=False)
        zf.writestr("{}{}.csv".format("df", index), csv_str)

EDIT:

Here is what I think pandas does with the compression options (you can look at the code in Github or in your local filesystem among the python libraries):

When the save function in pandas/io/formats/csvs.py is called, it will use get_handle from pandas/io/common.py with the compression options as a parameter. There, method is expected as the first entry. For zip, it will use a class named _BytesZipFile (derived from zipfile.ZipFile) with the handle (file path or buffer), mode and archive_name, which explains the example from the pandas documentation. Other parameters in **kwargs will just be passed through to the __init__ function of the super class (except for compression, which is set to ZIP_DEFLATED).

So it seems that you can pass allowZip64, compresslevel (python 3.7 and above), and strict_timestamps (python 3.8 and above) as documented here, which I could verify at least for allowZip64 with python 3.6.

I do not see a way to use something like the -j / --junk-paths option in the zipfile library.

Upvotes: 1

Related Questions