wilian
wilian

Reputation: 25

How to extract all text from pdf?

I'm using the PYPDF2 lib to extract texts from a PDF but I'm having a problem doing the loop.

I'm using the following code and I can extract a string from the first page.

from PyPDF2 import PdfFileReader
reader = PdfFileReader("mypdf.pdf")
# Print number of pages
num_page = reader.getNumPages()
print(num_page)
# Print the number of pages where [0] is the first page
page = reader.pages[0]
print(page.extractText())

I would like to use the page number that I get with .GetNumPages() and iterate the number of times over reader.pages[0]

Code that I'm trying to print the 99 pages:

from PyPDF2 import PdfFileReader reader = PdfFileReader("mypdf.pdf")
# Print number of pages num_page = reader.getNumPages() print(num_page)
# Print the number of pages where [0] is the first page

page = reader.pages[0] i = 0 print(type(num_page)) print(type(i)) for i in page:
    if i < num_page:
        page = reader.pages[i]
        print(page.extractText())
        i = i + 1
    else:
        print("done")

Error occurred:

Traceback (most recent call last):
  File "/home/wilian/PycharmProjects/ExtractText/pypdf.py", line 13, in <module>
    if i < num_page:
TypeError: '<' not supported between instances of 'NameObject' and 'int'
99
<class 'int'>
<class 'int'>

Process finished with exit code 1

Upvotes: 1

Views: 178

Answers (1)

0m3r
0m3r

Reputation: 12495

Try simple for range loop

Example

from PyPDF2 import PdfFileReader


def pdf_info():
    with open("my_pdf.pdf", "rb") as f:
        reader = PdfFileReader(f)
        for i in range(reader.getNumPages()):
            print(i)
            # page = reader.pages[i]
            # print(page.extractText())


if __name__ == '__main__':
    pdf_info()

Upvotes: 1

Related Questions