Reputation: 1904
I am trying to read a gzip file using pandas.read_csv
like so:
import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
But it throws this error:
ValueError: Passed header names mismatches usecols
However, if I manually extract the zip file from gz file, then read_csv
if able to read the data without errors:
df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
Since I have to read a lot of these files I don't want to manually extract them. So, how can I fix this error?
Upvotes: 0
Views: 125
Reputation: 143197
You have two levels of compression - gzip
and zip
- but pandas know how to work with only one level of compression.
You can use module gzip
and zipfile
with io.BytesIO
to extract it to file-like object
in memory.
Here minimal working code
It can be useful if zip
has many files and you want to select which one to extract
import pandas as pd
import gzip
import zipfile
import io
with gzip.open('data.csv.zip.gz') as f1:
data = f1.read()
file_like_object_1 = io.BytesIO(data)
with zipfile.ZipFile(file_like_object_1) as f2:
#print([x.filename for x in f2.filelist]) # list all filenames
#data = f2.read('data.csv') # extract selected filename
#data = f2.read(f2.filelist[0]) # extract first file
data = f2.read(f2.filelist[0].filename) # extract first file
file_like_object_2 = io.BytesIO(data)
df = pd.read_csv(file_like_object_2)
print(df)
But if zip
has only one file then you can use read_csv
to extract it - it needs to add option compression='zip'
because file-like object
has no filename and read_csv
can't use filename's extension to recognize compressed file.
import pandas as pd
import gzip
import io
with gzip.open('data.csv.zip.gz') as f1:
data = f1.read()
file_like_object_1 = io.BytesIO(data)
df = pd.read_csv(file_like_object_1, compression='zip')
print(df)
Upvotes: 1
Reputation: 305
You can use zipfile
module, such as :
import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
zip_ref.extractall(directory_to_extract_to)
Upvotes: 0
Reputation: 1065
use the gzip
module to unzip all your files somethings like this
for file in list_file_names:
file_name=file.replace(".gz","")
with gzip.open(file, 'rb') as f:
file_content = f.read()
with open(file_name,"wb") as r:
r.write(file_content)
Upvotes: 1