Ank
Ank

Reputation: 1904

Pandas read_csv throws ValueError while reading gzip file

I am trying to read a gzip file using pandas.read_csv like so:

import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

But it throws this error:

ValueError: Passed header names mismatches usecols

However, if I manually extract the zip file from gz file, then read_csv if able to read the data without errors:

df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

Since I have to read a lot of these files I don't want to manually extract them. So, how can I fix this error?

Upvotes: 0

Views: 125

Answers (3)

furas
furas

Reputation: 143197

You have two levels of compression - gzip and zip - but pandas know how to work with only one level of compression.

You can use module gzip and zipfile with io.BytesIO to extract it to file-like object in memory.


Here minimal working code

It can be useful if zip has many files and you want to select which one to extract

import pandas as pd
import gzip
import zipfile
import io

with gzip.open('data.csv.zip.gz') as f1:
    data = f1.read()

file_like_object_1 = io.BytesIO(data)

with zipfile.ZipFile(file_like_object_1) as f2:
    #print([x.filename for x in f2.filelist])  # list all filenames
    #data = f2.read('data.csv')                # extract selected filename
    #data = f2.read(f2.filelist[0])            # extract first file
    data = f2.read(f2.filelist[0].filename)    # extract first file

file_like_object_2 = io.BytesIO(data)

df = pd.read_csv(file_like_object_2)

print(df)

But if zip has only one file then you can use read_csv to extract it - it needs to add option compression='zip' because file-like object has no filename and read_csv can't use filename's extension to recognize compressed file.

import pandas as pd
import gzip
import io

with gzip.open('data.csv.zip.gz') as f1:
    data = f1.read()

file_like_object_1 = io.BytesIO(data)

df = pd.read_csv(file_like_object_1, compression='zip')

print(df)

Upvotes: 1

Only god knows
Only god knows

Reputation: 305

You can use zipfile module, such as :

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

Upvotes: 0

Mouad Slimane
Mouad Slimane

Reputation: 1065

use the gzip module to unzip all your files somethings like this

 for file in list_file_names:
    file_name=file.replace(".gz","")
    with gzip.open(file, 'rb') as f:
        file_content = f.read()
        with open(file_name,"wb") as r:
            r.write(file_content)

Upvotes: 1

Related Questions