Peque
Peque

Reputation: 14801

Verifying file integrity with Python

I have a directory with many big files. They have all been created with this line of code:

pickle.dump(variable, gzip.open(file_name, 'wb'), -1)

So they are basically compressed, serialized variables.

Now, at some point, a crash/interruption (or several) might have occurred in the past while executing that exact line. However I just do not know if that happened.

So, first, I am assuming that if something unexpected happened, there is the possibility of having a file_name in the file system which is corrupted, and does not (at least fully) contain the compressed, serialized variable. Am I right here?

Now I wonder if there is a way to check the integrity of those files without having to load them to memory one by one. I am trying to avoid executing pickle.load(gzip.open(file_name, 'rb')) with try/except.

Is this possible? Is there another (faster) way to check if pickle and gzip both finished successfully?

Upvotes: 1

Views: 2170

Answers (3)

xu.wang
xu.wang

Reputation: 1

I use the following method in python 2.6. In Python 2.7 you can use with as

try:                             
    f = gzip.open(filepath, 'rb')
    f._read_gzip_header()        
    return True                  
except Exception, e:             
    print e                      
    return False                 
finally:                         
    f.close()                    

Upvotes: 0

Peque
Peque

Reputation: 14801

Thanks to @ppperry's answer, I found a solution which is faster than de-serializing everything into memory.

f = gzip.open(file_name, 'rb')
f.seek(-1, os.SEEK_END)
f.read(1) == bytes('.', 'utf8')

Note that:

  • The second line can crash if the compressed file is malformed (use try/except).
  • The third line is the one which reads the last byte, which should be ..

Upvotes: 2

pppery
pppery

Reputation: 3804

Although I do not think that it is possible to check the validity of a gzip file other than by decompressing it, the pickled data protocol contains a STOP opcode that should be present at the end of all pickled data. (If it is missing, unpickling will raise an EOFError). This stop opcode is the . character. Thus you could partially check the validity of a pickle by checking if it ends with the . character. This also means that you can concatenate two valid pickles, and then unpickling the result twice will get the two objects. All pickles in protocol two or higher also begin with a \x80 () character.

Upvotes: 2

Related Questions