MysticForce
MysticForce

Reputation: 1311

Invalid gz file after splitting

I have a gz file of 500MB and I have split it as follows

split -b 100m "file.gz" "file1.gz.part-"

after splitting the following files are obtained

file1.gz.part-aa
file1.gz.part-ab
file1.gz.part-ac
file1.gz.part-ad
file1.gz.part-ae

I am trying to iterate over objects in gzip file using gzip as follows

 with gzip.open(filename) as f:
      for line in f:

This is working for file1.gz.part-aa but for the other 4 parts I am getting

Not a gzipped file error

Upvotes: 1

Views: 295

Answers (2)

pid
pid

Reputation: 11607

You can split before you gzip:

split -l 300000 "file.txt" "tweets1.part-"
      ^ every 300000 lines

Notice that the input of split is NOT a *.gz file but the original line-oriented file.

Then gzip every part separately:

gzip tweets1.part-*

This will also remove the parts (there's a gzip option to keep them).

In python, you can now consume each part separately.

Upvotes: 1

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798696

A gzip file has a header that identifies it as a gzip file. After splitting, only the first file will have this header. Rejoin the files before processing.

Upvotes: 1

Related Questions