Dulmini Jayasinghe
Dulmini Jayasinghe

Reputation: 41

How to extract text data from a multi page CV in a PDF format using pyPDF2?

I extracted text content from a multi page CV in a PDF format and trying to write that content in to a text file using pyPDF2. But I'm getting the following error message when trying to write the content.

Here is my code:

import PyPDF2

newFile = open('details.txt', 'w')
file = open("cv3.pdf", 'rb')

pdfreader = PyPDF2.PdfFileReader(file)
numPages = pdfreader.getNumPages()
print(numPages)

page_content = ""
for page_number in range(numPages):
    page = pdfreader.getPage(page_number)
    page_content += page.extractText()

newFile.write(page_content)
print(page_content)

file.close()
newFile.close()

The error message:

Traceback (most recent call last): File "C:/Users/HP/PycharmProjects/CVParser/pdf.py", line 16, in newFile.write(page_content) File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in position 827: character maps to

Process finished with exit code 1

This code was succeeded with the PDF file (docx file which converted in to a PDF) with multi pages.

Please help me if any one know the solution.

Upvotes: 1

Views: 916

Answers (1)

Rahul Agarwal
Rahul Agarwal

Reputation: 4100

This will solve your problem in Python 3:

with open("Output.txt", "w") as text_file:
    print("{}".format(page_content), file=text_file)

If above is not working for you somehow, the try below:

with open("Output1.txt", "wb") as text_file:

    text_file.write(page_content.encode("UTF-8"))

Upvotes: 1

Related Questions