I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files. I have the following labels: Invoice Packing list Certificate I am trying to figure out how I should approach this problem. My initial thoughts I was thinking the best way to solve this issue would be to perform text classification, based on the document text. Step 1 - Train a model First convert the PDF files to text. Then label the text content by one of the three labels. (Do this for a large dataset) Step 2 - Use the model Once the model is trained, for new incoming documents, convert it to text. Run the text content through the model to get the text classification. Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

pythonmachine-learningnlpcomputer-vision

Reputation: 6052

Document classification using machine learning

I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.

I have the following labels:

Invoice
Packing list
Certificate

I am trying to figure out how I should approach this problem.

My initial thoughts

I was thinking the best way to solve this issue would be to perform text classification, based on the document text.

Step 1 - Train a model

First convert the PDF files to text.
Then label the text content by one of the three labels. (Do this for a large dataset)

Step 2 - Use the model

Once the model is trained, for new incoming documents, convert it to text.
Run the text content through the model to get the text classification.

Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

Upvotes: -1

Answers (2)

user3126481

Reputation:

I understand your problem. Some key point about it a) First do pre-processing of input data. i.e ( for e.g. how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.

b) Trained Model using Image, Visual\layout and text. You will get good accuracy. c) You can used Computer vison and deep learning (Keras and tensorflow)

Upvotes: 0

Adnan S

Reputation: 1882

Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.

You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.

Upvotes: 1

Document classification using machine learning

My initial thoughts

Answers (2)

Related Questions