Training OpenNLP document classification

Question

I'm trying to use OpenNLP to classify invoices. Based on it's description I will group it into two classes. I have built a training file with 20K descriptions and tagged each one into the correct class.

The training data looks like (first column is a code, that I use as class, and the second column is the invoice description):

85171231 IPHONE 5S CINZA ESPACIAL 16GB (ME432BZA)
85171231 Galaxy S6 SM-G920I
85171231 motorola - MOTO G5 XT1672
00000000 MOTONETA ITALIKA AT110
00000000 CJ BOX UNIBOX MOLA 138X57X188 VINHO

Using DocumentCategorizer from OpenNLP, I achieved 98,5% of correctness. But, trying to improve the efficience, I took the wrong categorized documents and used it to expand the training data.

For instance, when I first run it, the "MOTONETA ITALIKA AT110" was classified as "85171231". It's ok, since into the first run the "MOTONETA ITALIKA AT110" wasn't classified. So, I teached the classifier explicitly puting "MOTONETA ITALIKA AT110" tagged as "00000000".

But, running it again, OpenNLP insists to classify it as "85171231" even though the training data contains an explicity map to "000000".

So my question is: Am I teaching OpenNLP wright? How do I improve it's efficiency?

The code that I'm using is:

MarkableFileInputStreamFactory dataIn = new MarkableFileInputStreamFactory("data.train");

ObjectStream lineStream = new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);
ObjectStream sampleStream = new DocumentSampleStream(lineStream);

TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, "100");
params.put(TrainingParameters.CUTOFF_PARAM, "0");

DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, new DoccatFactory());

DocumentCategorizer doccat = new DocumentCategorizerME(model);
double[] aProbs = doccat.categorize("MOTONETA ITALIKA AT110".replaceAll("[^A-Za-z0-9 ]", " ").split(" "));
doccat.getBestCategory(aProbs);

Training OpenNLP document classification

Answers (1)

Related Questions