Reputation: 11

How to efficiently convert the encoding of a gzipped text file?

I have multiple large CSV files (~1GB) that are compressed as GZ. My problem is that they are encoded in ISO-8859-1 and I would like them to be in UTF-8.

Obviously I could just decompress each file, convert them to UTF-8, and compress them back but this seems quite inefficient memory-wise to me.

Is there a clean and efficient way to do this on-the-spot and avoid having to temporary store large files?

Upvotes: 0

Answers (1)

Mark Adler

Reputation: 112374

You mentioned two different concerns, "inefficient memory-wise" and "temporary store large files" as if they were one question. They aren't.

You certainly do not need to and should not load the entire file into memory. You can use Python's GzipFile class to read small chunks of a file and write small chunks back out. So no memory problem.

In doing that, you would need to retain the input file in mass storage until the output file is complete, at which point you can delete the input file. While you can avoid having an intermediate uncompressed file in mass storage, you will at least need, temporarily, enough free mass storage for a second copy of the file.

Upvotes: 1

How to efficiently convert the encoding of a gzipped text file?

Answers (1)

Related Questions