Reputation: 13
I am trying to split 20 pages of pdf file (single) , into five respective pdf files , 1st pdf contains 1-3 pages , 2nd pdf file contains only 4th page, 3rd pdf contains 5 to 10 pages, 4th pdf contains 11-17 pages , and 5th pdf contains 18-20 page . I need the working code in python. The below mentioned code splits the entire pdf file into single pages, but I want the grouped pages..
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("input.pdf", "rb"))
for i in range(inputpdf.numPages):
j = i+1
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("page%s.pdf" % j, "wb") as outputStream:
output.write(outputStream)
Upvotes: 1
Views: 10947
Reputation: 2503
How to extract specific pages (or split specific pages) from a PDF file and save those pages as a separate PDF using Python.
pip install PyPDF2 # to install module/package
from PyPDF2 import PdfReader, PdfWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfWriter()
for page_num in pages:
pdfWriter.add_page(pdf.pages[page_num])
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
CREDIT : How to extract PDF pages and save as a separate PDF file using Python
Upvotes: 1
Reputation: 36360
For me it looks like task for pdfrw using this example from GitHub I written following example code:
from pdfrw import PdfReader, PdfWriter
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
for pagenum in range(*part):
outdata.addpage(pages[pagenum-1])
outdata.write()
This one create two files: pages_3_6.pdf
and pages_7_10.pdf
each with 3 pages i.e. 3,4,5 and 7,8,9. Note pagenum-1
in code, that -1
is used due to fact that pdf pages numeration starts at 1
rather than 0
. I also used so-called f-strings to get names of output files. In my opinion it is slick method but it is not available in Python2 and I am not sure if it is available in all Python3 versions (I tested my code in 3.6.7
), so you might use old formatting method instead if you wish.
Remember to alter filenames and ranges accordingly to your needs.
Upvotes: 3
Reputation: 1
if you have python 3, you can use tika according to the following answer here:
How to extract text from a PDF file?
Upvotes: -1