Reputation: 2374
I am using the textractor package to extract the text and the table that is present in a pdf document through AWS Textract:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(region_name='us-east-1')
document = extractor.start_document_analysis(
file_source="s3://<document>.pdf",
features=[TextractFeatures.TABLES],
)
text = document.document.pages[0].text
table_csv = document.document.pages[0].tables[0].to_csv()
This works well. However, I want to combine in a single text string (1) the text of the page with (2)the table on the page but WITHOUT overlapping text. Right now, the text
variable also contains the extracted text from the table_csv
content. If I just concatenate the strings, there will be duplicated information.
Is there a clean way to remove the overlapping text to achieve this?
Upvotes: 0
Views: 1389
Reputation: 701
Yes you can simply get the text of the page, and the tables in CSV format inside the text by using the correct linearization configuration.
your code:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(region_name='us-east-1')
document = extractor.start_document_analysis(
file_source="s3://<document>.pdf",
features=[TextractFeatures.TABLES],
)
followed by:
text = document.get_text(TextLinearizationConfig(table_column_separator=",", table_row_separator="\n"))
print(text)
Upvotes: 0