Reputation: 200
I'm trying to use Apache OpenNLP to identify date entities in a text. I created a little java program that generates a tag using a range of dates in the following format:
<START:date> {dd/MM/yyyy} <END> .
Each tag is a sentence in the format defined by OpenNLP.
I generated approximately 400k entries and trained the model. After the training, I tried to use TokenNameFinder by command line to verify if everything was ok but for every word, I typed the finder identified it as the dated entity. For example, when I typed:
today is 17/04/2017
what i got was:
<START:date> today <END> <START:date> is <END> <START:date> 17/04/2017 <END>
I thought it could be that I didn't provide it any word except the dates so I tried to use a random string before and after the tag but the training time was taking forever.
Can anyone tell me if this is a problem with my training dataset or anything else I must be doing?
Upvotes: 1
Views: 471
Reputation: 1431
To train a machine learning Name Finder model you would need a training corpus as close as possible to the runtime data. If your dates are well behaved and you don't need machine learning, you can try the regex based one RegexNameFinder.
If training is taking forever, either it is too big or you have few empty lines to mark the end of a document. Refer to the Named Entity Recognition documentation for details.
Upvotes: 2