Reputation: 8869
I am using the Amazon Textract API
, through AWS' Python API, to extract text from a document (pdf
or jpg
). I do get the text and coordinates of its bounding box, but I would also love to have the font type (only the major ones needed: Arial, Helvetica, Verdana, Calibri, Times New Roman + a few others).
Does anyone have a solution to get that piece of data?
The best solution may be a package, which accepts a small image, returns the font type name, and which I can run on my server. An external API would most likely be too costly (money and time-wise), as I have to run it 100+ times in a second.
{'BlockType': 'LINE',
'Confidence': 99.81985473632812,
'Text': 'This is a text',
'Geometry': {'BoundingBox': {'Width': 0.7395017743110657,
'Height': 0.012546566314995289,
'Left': 0.12995509803295135,
'Top': 0.2536422610282898},
'Polygon': [{'X': 0.12995509803295135, 'Y': 0.2536422610282898},
{'X': 0.8694568872451782, 'Y': 0.2536422610282898},
{'X': 0.8694568872451782, 'Y': 0.2661888301372528},
{'X': 0.12995509803295135, 'Y': 0.2661888301372528}]},
'Id': '59f42615-7f33-41d2-9f3c-77ae5e4b6e7a',
'Relationships': ...}
I implemented a solution which calculates the ratio width/height
of the text and compare this by programmatically drawing the same text using Python's pillow package and different font types and then comparing the ratio. However, that heuristic often leads to wrong results.
Upvotes: 4
Views: 5971
Reputation: 151
At the moment the Amazon Textract does not support font recognition. These two projects might help you:
Upvotes: 5