W0bble
W0bble

Reputation: 767

Edit Content of a file inside zip file

I have lots of zip archive with text files in them. I need to find and modify a specific text in the files. However I managed to search for all the relevant lines in the files using:

import os
import zipfile
from glob import glob

files = []
pattern   = "*.zip"
for dir,_,_ in os.walk(r'X:\zips'):
    files.extend(glob(os.path.join(dir,pattern)))

    for file in  files:
        root = zipfile.ZipFile(file, "r")
        for name in root.namelist():
            for line in root.read(name).split("\n"):
                if line.find("keyword") >= 0:
                   print line

I know that I can replace the keyword within the line. But how can I save it "inplace" without writing all the other text files to the hdd, delete old zip and create a new one?

Upvotes: 1

Views: 5037

Answers (1)

synthesizerpatel
synthesizerpatel

Reputation: 28036

You can't do this without doing some low level monkey business that is probably not supported out of the box with the zipfile module. However, it is possible.

First a quick explanation of the ZIP file structure:

From PKWare's ZIP file structure document

  [local file header 1]
  [encryption header 1]
  [file data 1]
  [data descriptor 1]
  . 
  .
  .
  [local file header n]
  [encryption header n]
  [file data n]
  [data descriptor n]
  [archive decryption header] 
  [archive extra data record] 
  [central directory header 1]
  .
  .
  .
  [central directory header n]
  [zip64 end of central directory record]
  [zip64 end of central directory locator] 
  [end of central directory record]

The file header looks like:

  local file header signature     4 bytes  (0x04034b50)
  version needed to extract       2 bytes
  general purpose bit flag        2 bytes
  compression method              2 bytes
  last mod file time              2 bytes
  last mod file date              2 bytes
  crc-32                          4 bytes
  compressed size                 4 bytes
  uncompressed size               4 bytes
  file name length                2 bytes
  extra field length              2 bytes

  file name (variable size)
  extra field (variable size)

The central directory structure looks like:

    central file header signature   4 bytes  (0x02014b50)
    version made by                 2 bytes
    version needed to extract       2 bytes
    general purpose bit flag        2 bytes
    compression method              2 bytes
    last mod file time              2 bytes
    last mod file date              2 bytes
    crc-32                          4 bytes
    compressed size                 4 bytes
    uncompressed size               4 bytes
    file name length                2 bytes
    extra field length              2 bytes
    file comment length             2 bytes
    disk number start               2 bytes
    internal file attributes        2 bytes
    external file attributes        4 bytes
    relative offset of local header 4 bytes

    file name (variable size)
    extra field (variable size)
    file comment (variable size)

There's a per-file CRC and size for each file, as well there is a CRC and size in the central directory. So, to modify a single file - depending on what you actually do to that file, the filesize will most likely change, and the CRC will 99% of the time change as well

That means that every file after that file would have to be pushed up in the file changing the overall archive size.

You can work around this by NOT compressing that specific file - the CRC will change but the overall file size will not (as long as you keep inside the boundaries of that single file.

You will at the very least however need to:

  1. Update the file CRC
  2. Update the central directory's CRC

It's worth noting that the central directory being at the end of the file is kind of a neat feature - since it means you can generate 'dynamic' zip files on the fly. I did this a while ago for a company that was selling MP3's online, I made a 'dynamic' zip packager that would essentially concatenate MP3 files together with the right ZIP headers so that you could add a bunch of songs to a 'download list', which would stream the MP3s from their homes on disk directly to the client - injecting the right header information and finally the central directory record - from the web server side it just was a series of reads and writes but on the client it looked like a 'real' zip file.

Upvotes: 6

Related Questions