PyPDF2 extractText() not working when embedded in loop inside AWS Lambda function

Question

I am calling a lambda function using an API Gateway. The user passes in a PDF which is read as bytes and sent to the lambda function. The lambda function breaks while looping over the pages when I try to execute "page_text = page_obj.extractText()". Here is the code for the lambda function:

def lambda_handler(event, context): # extract event data file_content = event["content"] decode_content = base64.b64decode(file_content) read_bytes=io.BytesIO(decode_content)

# Read file into PyPDF2
bill_reader = PyPDF2.PdfReader(read_bytes)

# Get Number of Pages in Bill
bill_pages = len(bill_reader.pages)
print(bill_pages)

# Get Text from Bill
bill_text = []
for page_num in range(bill_pages):
    page_obj = bill_reader.pages[page_num]
    page_text = page_obj.extractText() # Lambda fcn breaks here
    bill_text.append(page_text)
bill_text = ''.join(bill_text)

# There is more code after this to automatically summarize the text...

return {
'statusCode': 200,
'body': json.dumps(bill_text)
}

I am not sure exactly what happens but it seems like the function stalls out. The weird thing is this code works when I run it locally and works just fine outside the loop if I manually specify every single page to extract (ie page_obj = bill_reader.pages[0] then page_text = page_obj.extractText())

PyPDF2 extractText() not working when embedded in loop inside AWS Lambda function

Answers (1)

Related Questions