Why can I not open pdf files that have been copied with this code

Question

I need to do some manipulation of a number of pdf files. As a first step I wanted to copy them from a single directory into a tree that supports my needs. I used the following code

for doc in docList:
    #          these steps just create the directory structure I need from the file name
    fileName = doc.split('\')[-1]
    ID = fileName.split('_')[0]
    basedate = fileName.split('.')[0].split('_')[-1].strip()
    rdate = '\R' + basedate + '-' +'C' + basedate
    newID = str(cikDict[ID])
    newpath = basePath + newID + rdate
    #            check existence of the new path
    if not os.path.isdir(newpath):
        os.makedirs(newpath)
    #          reads the file in and then writes it to the new directory   
    fstring = open(doc).read()
    outref = open(newpath +'\' + fileName, 'wb')
    outref.write(fstring)
    outref.close()

When I run this code the directories are created and the there are files with the correct name in each directory. However, when I click to open a file I get an error from Acrobat informing me that the file was damaged and could not be repaired.

I was able to copy the files using

shutil.copy(doc,newpath)

To replace the last four lines - but I have not been able to figure out why I can't read the file as a string and then write it in a new location.

One thing I did was compare what was read from the source to what the file content was after a read after it had been written:

>>> newstring = open(newpath + '\' +fileName).read()
>>> newstring == fstring
True

So it does not appear the content was changed?

dawg · Accepted Answer

You should use shutil to copy files. It is platform aware and you avoid problems like this.

But you already discovered that.

You would be better served using with to open and close files. Then the files are opened and closed automatically. It is more idiomatic:

with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
    fout.write(fin.read())                     # the ENTIRE file is read with .read()

If potentially you are dealing with a large file, read and write in chunks:

with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
    while True:
        chunk=fin.read(1024)
        if chunk:
             fout.write(chunk)
        else:
             break

Note the 'rb' and 'wb' arguments to open. Since you are clearly opening this file under Windows, that prevents the interpretation of the file into a Windows string.

You should also use os.path.join rather than newpath + '\' +fileName type operation.

Why can I not open pdf files that have been copied with this code

Answers (2)

Related Questions