user2907249
user2907249

Reputation: 879

Read pdf object from S3

I am trying to create a lambda function that will access a pdf form uploaded to s3 and strip out the data entered into the form and send it elsewhere.

I am able to do this when I can download the file locally. So the below script works and allows me to read the data from the pdf into my pandas dataframe.:

import PyPDF2 as pypdf
import pandas as pd

s3 = boto3.resource('s3')
s3.meta.client.download_file(bucket_name, asset_key, './target.pdf')

pdfobject = open("./target.pdf", 'rb')
pdf = pypdf.PdfFileReader(pdfobject)
data = pdf.getFormTextFields()

pdf_df = pd.DataFrame(data, columns=get_cols(data), index=[0])

But with lambda I cannot save the file locally because I get a "read only filesystem" error.

I have tried using the s3.get_object() method like below:

s3_response_object= s3.get_object(
    Bucket='pdf-forms-bucket',
    Key='target.pdf',
)

pdf_bytes = s3_response_object['Body'].read()

But I have no idea how to convert the resulting bytes into an object that can be parsed with PyDF2. The output that I need and that PyDF2 will produce is like below:

{'form1[0].#subform[0].nameandmail[0]': 'Burt Lancaster',
 'form1[0].#subform[0].mailaddress[0]': '675 Creighton Ave, Washington DC',
 'form1[0].#subform[0].Principal[0]': 'David St. Hubbins',
 'Principal[1]': None,
 'form1[0].#subform[0].Principal[2]': 'Bart Simpson',
 'Principal[3]': None}

So in summary, I need o be able to read a pdf with fillable forms, into memory and parse it without downloading the file because my lambda function environment won't allow local temp files.

Upvotes: 3

Views: 21531

Answers (2)

Mayur Ghadge
Mayur Ghadge

Reputation: 153

Thanks @Harrison for previous solution, Please find alternate solution below:

import boto3
from io import BytesIO
from PyPDF2 import PdfReader

s3 = boto3.resource(service_name = "s3",
                    region_name = "your_region_name",
                    aws_access_key_id = "your_key_id",
                    aws_secret_access_key = "your_key")


obj = s3.Bucket('your_bucket_name').Object('file_key').get()

reader = PdfReader(BytesIO(obj['Body'].read()))

for page in reader.pages:
    print(f"Text: {page.extract_text()}")

Upvotes: 1

user2907249
user2907249

Reputation: 879

Edit (March 2024)

There seems to have been a change in the boto3 library, here is the latest solution

from io import BytesIO
import boto3
from PyPDF2 import PdfReader

s3 = boto3.client("s3")
pdf_file = s3.get_object(Bucket="***", Key="***")[
    "Body"
].read()
reader = PdfReader(BytesIO(pdf_file))

for page in reader.pages:
    print(f"Text: {page.extract_text()}")

Previous Answer:

Solved, this does the trick:

import boto3
from PyPDF2 import PdfFileReader
from io import BytesIO

bucket_name ="pdf-forms-bucket"
item_name = "form.pdf"


s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()
pdf = PdfFileReader(BytesIO(fs))

data = pdf.getFormTextFields()

Upvotes: 11

Related Questions