TJB
TJB

Reputation: 3836

Issue with PyPDF2 and decoding pdf file from S3

I am trying to get a pdf file stored in one of my S3 buckets in AWS, and get some of its metadata like number of pages, and file size. I successfully get the pdf file from the S3 bucket, getting this when calling print(obj)

s3.Object(bucket_name='somebucketname', key='somefilename.pdf')

When using PyPDF2.PdfFileReader() I try using the raw file, a UTF-8 decoded file, and a ISO-8859-1 decoded file. The ISO-8859-1 decoded file is the only one that doesn't raise an exception, but when trying to pass it into PdfFileReader as a parameter I get an error, and this traceback

Traceback (most recent call last):
  File "s3_test.py", line 18, in <module>
    pdfFile = PdfFileReader(parse3)
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1081, in __init__
    fileobj = open(stream, 'rb')
ValueError: embedded null byte

Am I using the wrong encoding type to decode this pdf file, or is it something else like the first argument of pdfFileReader has to be a file path? Is there an easier way to access an S3 pdf object's metadata without having to jump through hoops to get there?

Python Script

import boto3
from PyPDF2 import PdfReader

s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
parse3 = obj.get()['Body'].read().decode("ISO-8859-1")
pdfFile = PdfReader(parse3)

Upvotes: 3

Views: 7854

Answers (1)

Justin Leto
Justin Leto

Reputation: 103

Here's the overall strategy:

  1. Let PyPDF2 handle the decoding

PyPDF2 will be much smarter at determining how to decode the file than you will be. PdfFileReader can read from a stream or a path to a file so can read the file from S3 and prepare it as a byte stream. Let PdfFileReader do the hard work.

  1. Preparing the byte stream

To prepare the file stream as a byte stream you can use the BytesIO library.

Python 2:

from BytesIO import BytesIO

Python 3:

from io import BytesIO

For your code example:

from io import BytesIO

import boto3
from PyPDF2 import PdfReader


s3 = boto3.resource("s3")
obj = s3.Object(bucket_name, itemname)
fs = obj.get()["Body"].read()
reader = PdfReader(BytesIO(fs))

Upvotes: 8

Related Questions