Reputation: 767
I have lots of zip archive with text files in them. I need to find and modify a specific text in the files. However I managed to search for all the relevant lines in the files using:
import os
import zipfile
from glob import glob
files = []
pattern = "*.zip"
for dir,_,_ in os.walk(r'X:\zips'):
files.extend(glob(os.path.join(dir,pattern)))
for file in files:
root = zipfile.ZipFile(file, "r")
for name in root.namelist():
for line in root.read(name).split("\n"):
if line.find("keyword") >= 0:
print line
I know that I can replace the keyword within the line. But how can I save it "inplace" without writing all the other text files to the hdd, delete old zip and create a new one?
Upvotes: 1
Views: 5037
Reputation: 28036
You can't do this without doing some low level monkey business that is probably not supported out of the box with the zipfile module. However, it is possible.
First a quick explanation of the ZIP file structure:
From PKWare's ZIP file structure document
[local file header 1]
[encryption header 1]
[file data 1]
[data descriptor 1]
.
.
.
[local file header n]
[encryption header n]
[file data n]
[data descriptor n]
[archive decryption header]
[archive extra data record]
[central directory header 1]
.
.
.
[central directory header n]
[zip64 end of central directory record]
[zip64 end of central directory locator]
[end of central directory record]
The file header looks like:
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file name (variable size)
extra field (variable size)
The central directory structure looks like:
central file header signature 4 bytes (0x02014b50)
version made by 2 bytes
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file comment length 2 bytes
disk number start 2 bytes
internal file attributes 2 bytes
external file attributes 4 bytes
relative offset of local header 4 bytes
file name (variable size)
extra field (variable size)
file comment (variable size)
There's a per-file CRC and size for each file, as well there is a CRC and size in the central directory. So, to modify a single file - depending on what you actually do to that file, the filesize will most likely change, and the CRC will 99% of the time change as well
That means that every file after that file would have to be pushed up in the file changing the overall archive size.
You can work around this by NOT compressing that specific file - the CRC will change but the overall file size will not (as long as you keep inside the boundaries of that single file.
You will at the very least however need to:
It's worth noting that the central directory being at the end of the file is kind of a neat feature - since it means you can generate 'dynamic' zip files on the fly. I did this a while ago for a company that was selling MP3's online, I made a 'dynamic' zip packager that would essentially concatenate MP3 files together with the right ZIP headers so that you could add a bunch of songs to a 'download list', which would stream the MP3s from their homes on disk directly to the client - injecting the right header information and finally the central directory record - from the web server side it just was a series of reads and writes but on the client it looked like a 'real' zip file.
Upvotes: 6