Alper D.
Alper D.

Reputation: 25

Removing pages in a pdf file conditioning on something using Python

I have a PDF file that has around 1000 pages and want to remove some of the pages conditioning on not finding a specific word. For instance, the code would search for a specific word such as "STACKOVER", if it cannot find that word on the page, remove the page and continue to the following page, and at the end saves the file.

Upvotes: 2

Views: 1311

Answers (1)

The way to do this is: First, define the search words you are looking for (in my case I tested it on a medical journal and searched for searchwords=['unclear risk for poorly']). Second, find all pages containing the word or string and store the page numbers in a list (pages_to_delete). For safe keeping, I put them i a csv file giving the page in which a specific searchword is found. Third, open to original pdf, delete the pages contained in the list and save to a new pdf.

import PyPDF2
import re
from PyPDF2 import PdfFileWriter, PdfFileReader

pdfFileObj=open(r'C:\Users\s-degossondevarennes\......\dddtest.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['unclear risk for poorly']

pages_to_delete = []

with open('Pages.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
                if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    pages_to_delete.append(page)
                    

infile = PdfFileReader(r'C:\Users\s-degossondevarennes\.......\dddtest.pdf', 'rb')
output = PdfFileWriter()

for i in range(infile.getNumPages()):
    if i not in pages_to_delete:
        p = infile.getPage(i)
        output.addPage(p)

with open('Newdddtest.pdf', 'wb') as f:
    output.write(f)

Update

If you want to disregard whether the text is bold or not replace

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]

with

words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page])]

Upvotes: 1

Related Questions