Rishi N
Rishi N

Reputation: 67

What is the correct way of extracting texts using pypdf2

I am trying to extract text from a pdf file. I am using the following code for this task:

def get_pdf_text(file):
    pdffile = PyPDF2.PdfFileReader(file)
    numpages = pdffile.getNumPages()
    for pages in range(0,numpages):
        currpage = pdffile.getPage(pages)
        content = currpage.extractText().encode('UTF-8')
    return content

However ,the output I am getting is very different from the source file:

b'Inheritance is a basic concept of Object\n-\nOriented Programming where\n \nthe basic idea is to create new classes that add extra detail to\n \nexisting classes.
 This is done by allowing the new classes to reuse\n \nthe methods and variables of the existing classes and new methods and\n \nclasses are added to specialise the new class.
 Inheritance models the\n \n\n-\nkind\n-\n\nbjects), for example,\n \npostgraduates and undergraduates are both kinds of student. This kind\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \nclass exten\n\n \n \nInheritance can occur on several layers, where if visualised would\n \ndisplay a larger tree structure. For example, we could further extend\n \n\n \n\n\n \n\n \n\n \n\n \n'

Not only there are multiple \n occuring in unexpected locations, some of the content also seems to be missing. I can't seem to find a fix. Thank you in advance for your help

Upvotes: 1

Views: 345

Answers (1)

I'mahdi
I'mahdi

Reputation: 24069

  1. The problem is in your pdf file. I copied your text and created another pdf file, and now it's working.

  2. Add str() before returning

  3. Use print(pdf_text)

Here is the modified code:

import PyPDF2

def get_pdf_text(file):
    pdffile = PyPDF2.PdfFileReader(file)
    numpages = pdffile.getNumPages()
    for pages in range(0,numpages):
        currpage = pdffile.getPage(pages)
        content = str(currpage.extractText())
    return content

print(get_pdf_text('Untitled.pdf'))

Output:

'Inheritance is a basic concept of Object-Oriented Programming where the 
basic idea is to create new classes that add extra detail to existing classes. 
This is done by allowing the new classes to reuse the methods and 
variables of the existing classes and new methods and classes are added to 
specialise the new class. Inheritance models the Òis-kind-ofÓ relationship 
between entities (or objects), for example, postgraduates and 
undergraduates are both kinds of student. This kind of relationship can be 
visualised as a tree structure, where ÔstudentÕ would be the more general 
root node and both ÔpostgraduateÕ and ÔundergraduateÕ would be more 
specialised extensions of the ÔstudentÕ node (or the child nodes). In this 
relationship ÔstudentÕ would be 
known as the superclass or parent class whereas, ÔpostgraduateÕ would be 
known as the subclass or child class because the ÔpostgraduateÕ class 
extends the ÔstudentÕ class. 
Inheritance can occur on several layers, where if visualised would display 
a larger tree structure. For example, we could further extend the 
ÔpostgraduateÕ node by adding two extra extended classes to it called, 
ÔMSc StudentÕ and ÔPhD StudentÕ as both these types of student are kinds 
of postgraduate student. This would mean that both the ÔMSc StudentÕ and 
ÔPhD StudentÕ classes would inherit methods and variables from both the 
ÔpostgraduateÕ and Ôstudent classesÕ. '

Upvotes: 3

Related Questions