Reputation: 488
Consider the following snippet:
import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()
This results in the following error
ValueError: Codec is required for a binary I/O output
If i leave out output_type
i get the error
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.
I don't understand why this happens, and would like help with a workaround.
Upvotes: 2
Views: 1503
Reputation: 488
I figured out how to fix the problem:
First you need to open "file.pdf"
in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that.
For example
import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")
Upvotes: 4