Reputation: 25
I need to add a feature for my app to allow my clients to extract text from image texts and parse them to usable data like json format and store them to then be able to perform better data research.
Those image-texts are big pdf files (~150-500 pages) and my clients would want to be able to upload a large amount of those files because for now they have to look for the data they need by manually reading all the pdf files.
For now I'm considering using the google API Cloud Document AI which seems do do exactly what I need really easily, especially combined to the Document AI warehouse API. But I heard here and there that the OCR quality of document AI may not be reliable. Do you have any feedback about that? Or another way to do what I want?
Upvotes: 0
Views: 332
Reputation: 2234
Document AI Warehouse is specifically designed to integrate with Document AI to store, search, filter, and manage documents and related structured data. So this could be an option to try out if you want a full managed service for document management.
Document AI has processors designed for Optical Character Recognition and entity extraction processors for getting named entities from specific document types. You can try some processors using this demo and see how the OCR quality works for your specific documents. But OCR quality is going to vary based on the quality of the scanned documents and the specific model version that you use. You can follow this guide about document scan resolution for general guidance on input document quality, and the Document OCR processor does output image quality analysis information in addition to the OCR text. (TL;DR - It depends)
Upvotes: 0