Problem reading pdf to xml into memory using PDFMiner.Six

Question

Consider the following snippet:

import io
result = io.StringIO()
with open("file.pdf") as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue()

This results in the following error

ValueError: Codec is required for a binary I/O output

If i leave out output_type i get the error

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to ` instead.

I don't understand why this happens, and would like help with a workaround.

MennoK · Accepted Answer

I figured out how to fix the problem: First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that. For example

import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue().decode("utf-8")

Problem reading pdf to xml into memory using PDFMiner.Six

Answers (1)

Related Questions