parth shukla
parth shukla

Reputation: 117

How to convert bytes from PDF to string in Python?

I am trying to convert bytes which I get from book_download_page = requests.get(link) then content = book_download_page.content into string.

What I have tried,

content = book_download_page.content.decode('utf-8')

Error I get,

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Edit- You can try this link for downloading

Thank you!

Upvotes: 0

Views: 4331

Answers (1)

user176692
user176692

Reputation: 830

PDF contents are made up of tokens, see here:

Adobe PDF Reference

You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.

import pdfbox

pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf')   # Result .txt will be in directory/originalPDF.txt

Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.

Upvotes: 1

Related Questions