Michael
Michael

Reputation: 29

Search Multiple words from pdf

I'm trying to write a Python Script which will Find specific words in pdf files. Right now I have to scroll through the result to find the lines where its found.

I want the lines containing the word alone to be printed or saved as a separate file.

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("Filename.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
Strings = "House|Property|street"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(Strings, Text)
    print(ResSearch)

When I run the above code I need to scroll through the output to find the lines where the words are found. I expect the lines containing the words to be printed or saved as separate file or the page containing the line alone to be saved in separate pdf or txt file. Thanks for the help in advance

Upvotes: 1

Views: 2917

Answers (1)

Pieter
Pieter

Reputation: 3447

You could use re.match after splitting lines for the text on each page.

As an example:

for i in range(0, num_pages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('House|Property|street', line):
            print(line)

Upvotes: 1

Related Questions