user3207130
user3207130

Reputation: 1

Maintaining Data Type in Colaboratory

I'm trying to use PyPDF2 to read a pdf document and output a plain text string. However, when I upload my pdf file to colaboratory using the code:

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
  name=fn, length=len(uploaded[fn])))

it automatically coverts it to a str type rather than keeping it as an encoded string. This gives an error with PyPDF.PdfFileReader() but if you print the string it still has all the encoded characters:

gsutilCheatSheet.pdf => %PDF-1.5 %���� 1 0 obj <>/Metadata 117 0 R/ViewerPreferences 118 0 R>> endobj

etc.

Is there any way to keep the imported document in there original encoded format or is there another way to remove the encoding once it is already a str?

Upvotes: 0

Views: 460

Answers (1)

Bob Smith
Bob Smith

Reputation: 38659

I suspect you need to wrap your uploaded file in an io.BytesIO.

Here's a complete example showing how to open an uploaded PDF using PyPDF2 -- https://colab.research.google.com/notebook#fileId=1XlmXcp4xnrUGMUArevxiGNlrbMOMECO1

The key bit is:

pdf = PdfFileReader(io.BytesIO(uploaded['abc123.pdf']))

Upvotes: 0

Related Questions