Issue with PyPDF2 and decoding pdf file from S3

Question

I am trying to get a pdf file stored in one of my S3 buckets in AWS, and get some of its metadata like number of pages, and file size. I successfully get the pdf file from the S3 bucket, getting this when calling print(obj)

s3.Object(bucket_name='somebucketname', key='somefilename.pdf')

When using PyPDF2.PdfFileReader() I try using the raw file, a UTF-8 decoded file, and a ISO-8859-1 decoded file. The ISO-8859-1 decoded file is the only one that doesn't raise an exception, but when trying to pass it into PdfFileReader as a parameter I get an error, and this traceback

Traceback (most recent call last):
  File "s3_test.py", line 18, in 
    pdfFile = PdfFileReader(parse3)
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1081, in __init__
    fileobj = open(stream, 'rb')
ValueError: embedded null byte

Am I using the wrong encoding type to decode this pdf file, or is it something else like the first argument of pdfFileReader has to be a file path? Is there an easier way to access an S3 pdf object's metadata without having to jump through hoops to get there?

Python Script

import boto3
from PyPDF2 import PdfReader

s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
parse3 = obj.get()['Body'].read().decode("ISO-8859-1")
pdfFile = PdfReader(parse3)

Issue with PyPDF2 and decoding pdf file from S3

Answers (1)

Related Questions