Reputation: 6052
I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.
I have the following labels:
I am trying to figure out how I should approach this problem.
I was thinking the best way to solve this issue would be to perform text classification, based on the document text.
Step 1 - Train a model
Step 2 - Use the model
Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?
Upvotes: -1
Views: 1729
Reputation:
I understand your problem. Some key point about it a) First do pre-processing of input data. i.e ( for e.g. how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.
b) Trained Model using Image, Visual\layout and text. You will get good accuracy. c) You can used Computer vison and deep learning (Keras and tensorflow)
Upvotes: 0
Reputation: 1882
Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.
You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.
Upvotes: 1