Reputation: 5441
I'm trying to use OpenNLP to classify invoices. Based on it's description I will group it into two classes. I have built a training file with 20K descriptions and tagged each one into the correct class.
The training data looks like (first column is a code, that I use as class, and the second column is the invoice description):
85171231 IPHONE 5S CINZA ESPACIAL 16GB (ME432BZA)
85171231 Galaxy S6 SM-G920I
85171231 motorola - MOTO G5 XT1672
00000000 MOTONETA ITALIKA AT110
00000000 CJ BOX UNIBOX MOLA 138X57X188 VINHO
Using DocumentCategorizer from OpenNLP, I achieved 98,5% of correctness. But, trying to improve the efficience, I took the wrong categorized documents and used it to expand the training data.
For instance, when I first run it, the "MOTONETA ITALIKA AT110" was classified as "85171231". It's ok, since into the first run the "MOTONETA ITALIKA AT110" wasn't classified. So, I teached the classifier explicitly puting "MOTONETA ITALIKA AT110" tagged as "00000000".
But, running it again, OpenNLP insists to classify it as "85171231" even though the training data contains an explicity map to "000000".
So my question is: Am I teaching OpenNLP wright? How do I improve it's efficiency?
The code that I'm using is:
MarkableFileInputStreamFactory dataIn = new MarkableFileInputStreamFactory("data.train");
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, "100");
params.put(TrainingParameters.CUTOFF_PARAM, "0");
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, new DoccatFactory());
DocumentCategorizer doccat = new DocumentCategorizerME(model);
double[] aProbs = doccat.categorize("MOTONETA ITALIKA AT110".replaceAll("[^A-Za-z0-9 ]", " ").split(" "));
doccat.getBestCategory(aProbs);
Upvotes: 1
Views: 1857
Reputation: 1431
By default, DocumentCategorizer will use bag of words. It means that the sequence of terms are not take into account.
If any term of MOTONETA ITALIKA AT110
occurs with high frequency in the group 85171231
, the classifier would be inclined to use that group.
You have a few alternatives:
MOTONETA ITALIKA AT110
to the group 000000
;The second option would be to change the creation of your model, like this:
int minNgramSize = 2;
int maxNgramSize = 3;
DoccatFactory customFactory = new DoccatFactory(
new FeatureGenerator[]{
new BagOfWordsFeatureGenerator(),
new NGramFeatureGenerator(minNgramSize, maxNgramSize)
}
);
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);
You can play with the feature generator by removing the BagOfWordsFeatureGenerator and changing the min and max ngram size.
Upvotes: 3