Reputation: 536
I am having trouble finding the compression options available to me.. At the bottom of this page: to_csv they have an example that shows 2 options:
compression_opts = dict(method='zip', archive_name='out.csv')
But I see no listing of all options available.. and can't find one elsewhere. I'd love to see the full list (assuming there are more than these 2)
End goal currently: the zip operation zips the file up in a zip file, but all the folders are also within the zip file, so that the file is actually buried in a bunch of folders within the zip. I'm sure there is an easy option to prevent the folders from being added to the zip...
Upvotes: 2
Views: 2012
Reputation: 1369
I think I understand your question. Let's say I have a dataframe, and I want to save it locally in a zipfile. And let's say I want that zipfile saved at the location somepath/myfile.zip
Let's say I run this program (also assuming that somepath/
is a valid folder in the current working directory):
### with_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
If I list the content of the resulting file, I can see the path I wanted to store the zip file at was ALSO used as the name of the file INSIDE the zip, including the folder structure, and still named .zip even, which is weird:
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:56 somepath/myfile.zip
--------- -------
17 1 file
Instead, I can supply an archive_name
as a compression option to explicitly provide a name for my file inside the zip. Like so:
### without_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip", "archive_name": f"{filename}.csv"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
Now although our resulting zip file was still written at the desired location of somepath/
the file inside the zip does NOT include the path as part of the filename, and is correctly named with a .csv extension.
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:59 myfile.csv
--------- -------
17 1 file
The strange default behavior doesn't seem to be called out in the documentation, but you can see the use of the archive_name
parameter in the final example of the pandas.DataFrame.to_csv documentation. IMHO they should throw an error and force you to provide an archive_name
value, because I can't imagine when you would want to name the file inside a zip the exact same as the zip file itself.
Upvotes: 3
Reputation: 2803
After some reseach about to_csv
's integrated the compression mechanisms, I would suggest a different appoach for your problem:
Assuming that you have a number of DataFrames that you want to write to your zip file as individual csv files (for this example I keep the DataFrames in a list, so I can loop over them):
df_list = []
df_list.append(pd.DataFrame({'col1':[1, 2, 3],
'col2':['A', 'B', 'C']}))
df_list.append(pd.DataFrame({'col1':[4, 5, 6],
'col2':['a', 'b', 'c']}))
Then you can convert each of these DataFrames to a csv string in memory (not a file) and write that string to your zip archive as a file (e.g. df0.csv
, df1.csv
, ...):
with zipfile.ZipFile(file="out.zip", mode="w") as zf:
for index, df in enumerate(df_list):
csv_str = df.to_csv(index=False)
zf.writestr("{}{}.csv".format("df", index), csv_str)
EDIT:
Here is what I think pandas does with the compression options (you can look at the code in Github or in your local filesystem among the python libraries):
When the save
function in pandas/io/formats/csvs.py
is called, it will use get_handle
from pandas/io/common.py
with the compression options as a parameter. There, method is expected as the first entry. For zip, it will use a class named _BytesZipFile
(derived from zipfile.ZipFile
) with the handle (file path or buffer), mode and archive_name, which explains the example from the pandas documentation. Other parameters in **kwargs
will just be passed through to the __init__
function of the super class (except for compression, which is set to ZIP_DEFLATED
).
So it seems that you can pass allowZip64, compresslevel (python 3.7 and above), and strict_timestamps (python 3.8 and above) as documented here, which I could verify at least for allowZip64 with python 3.6.
I do not see a way to use something like the -j
/ --junk-paths
option in the zipfile
library.
Upvotes: 1