wheeeee
wheeeee

Reputation: 1525

Recovering data from a corrupted, possibly partial zip

I'm working with some old legacy code and getting some build errors. I have a zip file called vocab100k.zip, and the code says that it should unzip to include 2 files: vocab.100k.utf8 and vectors.100k.utf8.

When I try to run System.IO.Compression.ZipFile.OpenRead(zipFileFullPath), I get System.IO.InvalidDataException: 'End of Central Directory record could not be found.' When I try to manually unzip through the File Explorer using WinRAR, I get "Unexpected end of archive".

Double clicking to preview the contents shows me that one of my two files is present inside. enter image description here

I used WinRAR's repair function but attempted extraction on the repaired zip will load to about 90% before it throws the folowing errors.

enter image description here

I suspect that this may have been one of a multi-part zip at some point, and the later zips have been lost. Is there any way to extract even a partial of the vectors.100k.utf8 that I see there? Are there maybe other ways the zip could have been corrupted?

Upvotes: 0

Views: 2933

Answers (2)

Jiří Kuneš
Jiří Kuneš

Reputation: 51

If you have access to Linux, you can try using zip tool to create fixed version of the archive:

zip -FF vocab100k.zip --out vocab100k_fixed.zip

But this works only if the file you want to extract is not missing any parts.

Upvotes: 2

pmqs
pmqs

Reputation: 3735

Recovering Data from a Truncated Zip File

Assuming the file is simply truncated in the middle of vectors.100k.utf8 and the corruption isn't more serious, you should be able to recover part of the data. The output you've shown does suggest that this is a truncation issue. Won't know for sure without the zipdetails output I requested.

If this is just a truncation issue, you may be able to uncompress what is present with the perl script, recoverzip, below. This should work on Windows, MacOS or Linux -- the only prerequisite is you need perl installed.

use strict ;
use warnings ;

use IO::Uncompress::Unzip qw( unzip $UnzipError );

die "Usage: recoverzip zipfile member outfile\n"
    if @ARGV != 3;

my $filename = shift;
my $name = shift;
my $outfile = shift;

unzip $filename  => $outfile,
           Name  => $name,
    or die "Cannot uncompress '$filename': $UnzipError\n" ;

The script takes three parameters

  • the name of the zip file to process
  • the name of the zip member to read
  • the output filename to store the recovered data

This script isn't guaranteed to get any data from a truncated zip file, but it can in some cases. It just depends where the truncation is at.

Create a truncated zip file

Here is a worked example to show how it works. Note that I'm using Linux tools to generate the truncated zip file. The recovery part is not dependent on Linux -- all just need is to have perl installed on your system.

First pick an input file to add to a zip file

$ cat lorem.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.

Add lorem.txt to a zip file called try.zip

$ zip try.zip lorem.txt
$ unzip -l try.zip 
Archive:  try.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      446  2022-09-09 09:17   lorem.txt
---------                     -------
      446                     1 file

Now we need to truncate try.zip in the middle of the lorem.txt member. To do that we need to understand where the compressed data lives at in the zip file. Can use zipdetails to get that information.

$ perl zipdetails try.zip

0000 LOCAL HEADER #1       04034B50
0004 Extract Zip Spec      14 '2.0'
0005 Extract OS            00 'MS-DOS'
0006 General Purpose Flag  0000
     [Bits 1-2]            0 'Normal Compression'
0008 Compression Method    0008 'Deflated'
000A Last Mod Time         55294A2E 'Fri Sep  9 10:17:28 2022'
000E CRC                   F90EE7FF
0012 Compressed Length     0000010E
0016 Uncompressed Length   000001BE
001A Filename Length       0009
001C Extra Length          001C
001E Filename              'lorem.txt'
0027 Extra ID #0001        5455 'UT: Extended Timestamp'
0029   Length              0009
002B   Flags               '03 mod access'
002C   Mod Time            631AF698 'Fri Sep  9 09:17:28 2022'
0030   Access Time         631AF698 'Fri Sep  9 09:17:28 2022'
0034 Extra ID #0002        7875 'ux: Unix Extra Type 3'
0036   Length              000B
0038   Version             01
0039   UID Size            04
003A   UID                 000003E8
003E   GID Size            04
003F   GID                 000003E8
0043 PAYLOAD

0151 CENTRAL HEADER #1     02014B50
0155 Created Zip Spec      1E '3.0'
0156 Created OS            03 'Unix'
0157 Extract Zip Spec      14 '2.0'
0158 Extract OS            00 'MS-DOS'
0159 General Purpose Flag  0000
     [Bits 1-2]            0 'Normal Compression'
015B Compression Method    0008 'Deflated'
015D Last Mod Time         55294A2E 'Fri Sep  9 10:17:28 2022'
0161 CRC                   F90EE7FF
0165 Compressed Length     0000010E
0169 Uncompressed Length   000001BE
016D Filename Length       0009
016F Extra Length          0018
0171 Comment Length        0000
0173 Disk Start            0000
0175 Int File Attributes   0001
     [Bit 0]               1 Text Data
0177 Ext File Attributes   81ED0000
017B Local Header Offset   00000000
017F Filename              'lorem.txt'
0188 Extra ID #0001        5455 'UT: Extended Timestamp'
018A   Length              0005
018C   Flags               '03 mod access'
018D   Mod Time            631AF698 'Fri Sep  9 09:17:28 2022'
0191 Extra ID #0002        7875 'ux: Unix Extra Type 3'
0193   Length              000B
0195   Version             01
0196   UID Size            04
0197   UID                 000003E8
019B   GID Size            04
019C   GID                 000003E8

01A0 END CENTRAL HEADER    06054B50
01A4 Number of this disk   0000
01A6 Central Dir Disk no   0000
01A8 Entries in this disk  0001
01AA Total Entries         0001
01AC Size of Central Dir   0000004F
01B0 Offset to Central Dir 00000151
01B4 Comment Length        0000
Done

There is quite a lot of output from zipdetails, but for our purposes we need to look at the PAYLOAD line -- that shows the offset where the compressed data for lorem.txt starts. In this case it is hex 043. The next field is the CENTRAL HEADER at offset hex 0151. So that means the compressed payload starts at offset 0x43 and ends at 0x150.

Now truncate the zip file in the middle of the lorem.txt compressed data at offset 0x100 and write the truncated zip file to trunc.zip

$ head -c $((0x100)) try.zip >trunc.zip

We now have a sample truncated zip file to test. First check what unzip thinks of the truncated file - it shows a very similar error to yours

$ unzip -t trunc.zip 
Archive:  trunc.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of trunc.zip or
        trunc.zip.zip, and cannot find trunc.zip.ZIP, period.

Recover data from the truncated zip file

Now run the recoverzip script to see if we can get any data from the zip file..

$ perl recoverzip trunc.zip lorem.txt recovered.txt
Cannot uncompress 'trunc.zip': unexpected end of file

The unexpected end of file error is to be expected in this use-case.

Finally, let's see what data was recovered

$ cat recovered.txt 
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit e

Success! In this instance we have recovered some of the data from lorem.txt.

Upvotes: 2

Related Questions