How to extract and combine text and tables from PDF using AWS Textract

Question

I am using the textractor package to extract the text and the table that is present in a pdf document through AWS Textract:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(region_name='us-east-1')
document = extractor.start_document_analysis(
    file_source="s3://.pdf",
    features=[TextractFeatures.TABLES],
)

text = document.document.pages[0].text
table_csv = document.document.pages[0].tables[0].to_csv()

This works well. However, I want to combine in a single text string (1) the text of the page with (2)the table on the page but WITHOUT overlapping text. Right now, the text variable also contains the extracted text from the table_csv content. If I just concatenate the strings, there will be duplicated information.

Is there a clean way to remove the overlapping text to achieve this?

How to extract and combine text and tables from PDF using AWS Textract

Answers (1)

Related Questions