Reputation: 3836
I am trying to get a pdf file stored in one of my S3 buckets in AWS, and get some of its metadata like number of pages, and file size. I successfully get the pdf file from the S3 bucket, getting this when calling print(obj)
s3.Object(bucket_name='somebucketname', key='somefilename.pdf')
When using PyPDF2.PdfFileReader() I try using the raw file, a UTF-8 decoded file, and a ISO-8859-1 decoded file. The ISO-8859-1 decoded file is the only one that doesn't raise an exception, but when trying to pass it into PdfFileReader as a parameter I get an error, and this traceback
Traceback (most recent call last):
File "s3_test.py", line 18, in <module>
pdfFile = PdfFileReader(parse3)
File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1081, in __init__
fileobj = open(stream, 'rb')
ValueError: embedded null byte
Am I using the wrong encoding type to decode this pdf file, or is it something else like the first argument of pdfFileReader has to be a file path? Is there an easier way to access an S3 pdf object's metadata without having to jump through hoops to get there?
Python Script
import boto3
from PyPDF2 import PdfReader
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
parse3 = obj.get()['Body'].read().decode("ISO-8859-1")
pdfFile = PdfReader(parse3)
Upvotes: 3
Views: 7854
Reputation: 103
Here's the overall strategy:
PyPDF2 will be much smarter at determining how to decode the file than you will be. PdfFileReader can read from a stream or a path to a file so can read the file from S3 and prepare it as a byte stream. Let PdfFileReader do the hard work.
To prepare the file stream as a byte stream you can use the BytesIO library.
Python 2:
from BytesIO import BytesIO
Python 3:
from io import BytesIO
For your code example:
from io import BytesIO
import boto3
from PyPDF2 import PdfReader
s3 = boto3.resource("s3")
obj = s3.Object(bucket_name, itemname)
fs = obj.get()["Body"].read()
reader = PdfReader(BytesIO(fs))
Upvotes: 8