Reputation: 117
I am trying to convert bytes which I get from book_download_page = requests.get(link)
then content = book_download_page.content
into string.
What I have tried,
content = book_download_page.content.decode('utf-8')
Error I get,
'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Edit- You can try this link for downloading
Thank you!
Upvotes: 0
Views: 4331
Reputation: 830
PDF contents are made up of tokens, see here:
You can parse PDFs and extract text, with tools like PoDoFo in C++, PDFBox in Java, and there is also a PDF text stripper in Python.
import pdfbox
pdf_ref = pdfbox.PDFBox()
pdf_ref.extract_text('directory/originalPDF.pdf') # Result .txt will be in directory/originalPDF.txt
Simple example paraphrased from python-pdfbox in case if you want to convert other things like images too.
Upvotes: 1