How to edit a pdf file, replacing its data?

Question

I am trying to rotate pages in a pdf file, and then replace the old pages with the rotated ones in the SAME pdf file.

I have written the following code:

#!/usr/bin/python

import os
from pyPdf import PdfFileReader, PdfFileWriter

my_path = "/home/USER/Desktop/files/"

input_file_name = os.path.join(my_path, "myfile.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
input_file.decrypt("MyPassword")
output_PDF = PdfFileWriter()

for num_page in range(0, input_file.getNumPages()):
    page = input_file.getPage(num_page)
    page.rotateClockwise(270)
    output_PDF.addPage(page)

#Trying to replace old data with new data in the original file, not
#create a new file and add the new data!
output_file_name = os.path.join(my_path, "myfile.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()

The above code gives me an error! I 've even tried using:

input_file = PdfFileReader(file(input_file_name, "r+b"))

but it didn't work either...

Changing the line:

output_file_name = os.path.join(my_path, "myfile.pdf")

with:

output_file_name = os.path.join(my_path, "myfile2.pdf")

fixes everything, but it's not what I want...

Any help?

ERROR CODE:

Traceback (most recent call last): File "12-5.py", line 22, in output_PDF.write(output_file) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 264, in write self._sweepIndirectReferences(externalReferenceMap, self._root) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 339, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 315, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 324, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 345, in _sweepIndirectReferences newobj = data.pdf.getObject(data) File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 649, in getObject retval = readObject(self.stream, self) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 67, in readObject return DictionaryObject.readFromStream(stream, pdf) File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 564, in readFromStream raise utils.PdfReadError, "Unable to find 'endstream' marker after stream." pyPdf.utils.PdfReadError: Unable to find 'endstream' marker after stream.

David Wolever · Accepted Answer

The issue, I suspect, is that PyPDF is reading from the file as it's being written to.

The correct fix — as you've noticed — is to write to a separate file, then replace the original file with the new file. Something like this:

output_file_name = os.path.join(my_path, "myfile-temporary.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
os.rename(output_file_name, input_file_name)

I've written a bit of code which simplifies this: https://github.com/shazow/unstdlib.py/blob/master/unstdlib/standard/contextlib_.py#L14

from unstdlib.standard.contextlib_ import open_atomic

with open_atomic(input_file_name, "wb") as output_file:
    output_PDF.write(output_file)

This will automatically create a temporary file, write to it, then replace the original file.

edit: I had initially mis-read the question. Below is my incorrect but potentially helpful to other people answer.

Your code is fine, and should work without issue on "most" PDFs.

The issue you're seeing is an incompatibility between PyPDF and the specific PDF you're trying to use. This may be a bug in PyPDF or it may be that the PDF isn't totally valid.

Two things you can try:

See if PyPDF2 can read the file. Install PyPDF2 with pip install PyPDF2, replace import pyPdf … with import PyPDF2 …, then re-run your script.
Use another program to re-encode your PDF and see if that works. For example, using something like convert bad.pdf bad.ps; convert bad.ps maybe-good.pdf might fix things.

How to edit a pdf file, replacing its data?

Answers (1)

Related Questions