kahlo
kahlo

Reputation: 2374

How to extract and combine text and tables from PDF using AWS Textract

I am using the textractor package to extract the text and the table that is present in a pdf document through AWS Textract:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(region_name='us-east-1')
document = extractor.start_document_analysis(
    file_source="s3://<document>.pdf",
    features=[TextractFeatures.TABLES],
)

text = document.document.pages[0].text
table_csv = document.document.pages[0].tables[0].to_csv() 

This works well. However, I want to combine in a single text string (1) the text of the page with (2)the table on the page but WITHOUT overlapping text. Right now, the text variable also contains the extracted text from the table_csv content. If I just concatenate the strings, there will be duplicated information.

Is there a clean way to remove the overlapping text to achieve this?

Upvotes: 0

Views: 1389

Answers (1)

Thomas
Thomas

Reputation: 701

Yes you can simply get the text of the page, and the tables in CSV format inside the text by using the correct linearization configuration.

your code:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(region_name='us-east-1')
document = extractor.start_document_analysis(
    file_source="s3://<document>.pdf",
    features=[TextractFeatures.TABLES],
)

followed by:

text = document.get_text(TextLinearizationConfig(table_column_separator=",", table_row_separator="\n"))

print(text)

Upvotes: 0

Related Questions