Reputation: 1
I'm trying to use PyPDF2 to read a pdf document and output a plain text string. However, when I upload my pdf file to colaboratory using the code:
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
it automatically coverts it to a str type rather than keeping it as an encoded string. This gives an error with PyPDF.PdfFileReader() but if you print the string it still has all the encoded characters:
gsutilCheatSheet.pdf => %PDF-1.5 %���� 1 0 obj <>/Metadata 117 0 R/ViewerPreferences 118 0 R>> endobj
etc.
Is there any way to keep the imported document in there original encoded format or is there another way to remove the encoding once it is already a str?
Upvotes: 0
Views: 460
Reputation: 38659
I suspect you need to wrap your uploaded file in an io.BytesIO
.
Here's a complete example showing how to open an uploaded PDF using PyPDF2 -- https://colab.research.google.com/notebook#fileId=1XlmXcp4xnrUGMUArevxiGNlrbMOMECO1
The key bit is:
pdf = PdfFileReader(io.BytesIO(uploaded['abc123.pdf']))
Upvotes: 0