Joey Morrow
Joey Morrow

Reputation: 351

How to convert a human-readable timeline to table using existing ML tools?

I have this timeline from a newspaper produced by my Native American tribe. I was trying to use AWS Textract to produce some kind of table from this. AWS Textract does not recognize any tables in this. So I don't think that will work (perhaps more can happen there if I pay, but it doesn't say so).

Ultimately, I am trying to sift through all the archived newspapers and download all the timelines for all of our election cycles (both "general" and "special advisory") to find number of days between each item in timeline.

Since this is all in the public domain, I see no reason I can't paste a picture of the table here. I will include the download URL for the document as well.enter image description here

Download URL: Download

I started off by using Foxit Reader on individual documents to find the timelines on Windows.

Then I used a tool 'ocrmypdf' on ubuntu to ensure all these documents are searchable (ocrmypdf --skip-text Notice_of_Special_Election_2023.pdf.pdf ./output/Notice_of_Special_Election_2023.pdf).

Then I just so happened to see an ad for AWS Textract this morning in my Google Newsfeed. Saw how powerful it is. But when I tried it, it didn't actually find these human-readable timelines.

I'm hopefully wondering if any ML tools or even other solutions exist for this type of problem.

I am namely trying to keep my tech knack up to par. I was sick the last two years and this is a fun problem to tackle that I think is pretty fringe.

Upvotes: 1

Views: 70

Answers (1)

Thomas
Thomas

Reputation: 701

Actually it seems that textract does detect this table. You can use the amazon-textract-textractor package to simplify calling and using the textract response.

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./Notice_of_Special_Election_2023.pdf',    
    features=[TextractFeatures.TABLES],
    s3_upload_path='<YOUR_S3_BUCKET>',
    s3_output_path='<YOUR_S3_BUCKET>',
    save_image=True,
)
document1.pages[0].visualize(with_words=False)

table result

And get the extcrated table in dataframe like this:

document1.tables[0].to_pandas()

data frame

Upvotes: 1

Related Questions