PyNEwbie
PyNEwbie

Reputation: 4940

Why can I not open pdf files that have been copied with this code

I need to do some manipulation of a number of pdf files. As a first step I wanted to copy them from a single directory into a tree that supports my needs. I used the following code

for doc in docList:
    #          these steps just create the directory structure I need from the file name
    fileName = doc.split('\\')[-1]
    ID = fileName.split('_')[0]
    basedate = fileName.split('.')[0].split('_')[-1].strip()
    rdate = '\\R' + basedate + '-' +'C' + basedate
    newID = str(cikDict[ID])
    newpath = basePath + newID + rdate
    #            check existence of the new path
    if not os.path.isdir(newpath):
        os.makedirs(newpath)
    #          reads the file in and then writes it to the new directory   
    fstring = open(doc).read()
    outref = open(newpath +'\\' + fileName, 'wb')
    outref.write(fstring)
    outref.close()

When I run this code the directories are created and the there are files with the correct name in each directory. However, when I click to open a file I get an error from Acrobat informing me that the file was damaged and could not be repaired.

I was able to copy the files using

shutil.copy(doc,newpath) 

To replace the last four lines - but I have not been able to figure out why I can't read the file as a string and then write it in a new location.

One thing I did was compare what was read from the source to what the file content was after a read after it had been written:

>>> newstring = open(newpath + '\\' +fileName).read()
>>> newstring == fstring
True

So it does not appear the content was changed?

Upvotes: 0

Views: 587

Answers (2)

mkl
mkl

Reputation: 95963

I have not been able to figure out why I can't read the file as a string and then write it in a new location.

Please be aware that PDF is a binary file format, not a text file format. Methods treating files (or data in general) as text may change it in different ways, especially:

  • Reading data as text interprets bytes and byte sequences as characters according to some character encoding. Writing text back as data again transforms according some character encoding, too.

    If the applied encodings differ, the result obviously differs from the original file. But even if the same encoding was used, differences can creep in: If the original file contains bytes which have no meaning in the applied encoding, some replacement character is used instead and the final result file contains the encoding of that replacement character, not the original byte sequence. Furthermore some encodings have multiple possible encodings for the same character. Thus, some input byte sequence may be replaced by some other sequence representing the same character in the output.

  • End-of-line sequences may be changed according to the preferences of the platform.

    Binary files may contain different byte sequences used as end-of-line marker on one or the other platform, e.g. CR, LF, CRLF, ... Methods treating the data as text may replace all of them by the one sequence favored on the local platform. But as these bytes in binary files may have a different meaning than end-of-line, this replacement may be destructive.

  • Control characters in general may be ignored

    In many encodings the bytes 0..31 have meanings as control characters. Methods treating binary data as text may interpret them somehow which may result in a changed output again.

All these changes can utterly destroy binary data, e.g. compressed streams inside PDFs.

You could try using binary mode for reading files by also opening them with a b in the mode string. Using binary mode both while reading and writing may solve your issue.

One thing I did was compare what was read from the source to what the file content was after a read after it had been written:

>>> newstring = open(newpath + '\\' +fileName).read()
>>> newstring == fstring
True

So it does not appear the content was changed?

Your comparison also reads the files as text. Thus, you do not compare the actual byte contents of the original and the copied file but their interpretations according to the encoding assumed while reading them. So damage has already been done on both sides of your comparison.

Upvotes: 2

dawg
dawg

Reputation: 104024

You should use shutil to copy files. It is platform aware and you avoid problems like this.

But you already discovered that.

You would be better served using with to open and close files. Then the files are opened and closed automatically. It is more idiomatic:

with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
    fout.write(fin.read())                     # the ENTIRE file is read with .read()

If potentially you are dealing with a large file, read and write in chunks:

with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
    while True:
        chunk=fin.read(1024)
        if chunk:
             fout.write(chunk)
        else:
             break

Note the 'rb' and 'wb' arguments to open. Since you are clearly opening this file under Windows, that prevents the interpretation of the file into a Windows string.

You should also use os.path.join rather than newpath + '\\' +fileName type operation.

Upvotes: 1

Related Questions