Keaton MacLeod
Keaton MacLeod

Reputation: 83

Custom OpenNLP Name Finder recognizes data in training set, but not in testing set

So I finally got OpenNLP incorporated into my project, and I have successfully trained my model on 15k lines of training data, stored it, and can load it when I want to use it to recognize entities in my program!

I am using it to recognize hashtags, so my training data looks something like this:

    ...
    Jim , I know you to be a fighter <START:HASHTAG> #usmarine <END> @ USMC Kira has your strength & amp ; ours @ t1r1u1t1h R love 2 U , Kira & amp ; 
    What has changed that people from your JAMAT are insulting Hindu GODS and GODDESSES . Calling our Religion names ... . 
    Ibtihaj represented the United States of America at the Olympics and brought home a medal , elevating the status of 
    A story point is a metric used in agile project management and development to determine ( or estimate ) the difficul 
    I 'm not shy or quiet , I just do n't find your mind appealing in any way shape or form and I 'm not gon na force a conv 
    <START:HASHTAG> #paradisepapers <END> , Canadian Taxpayers Federation ( CTF ) & amp ; tax reform `` CTF has not uttered even a single shocked-and-a 
    ...

I am finding that the model is unable to recognize any hashtags if it is passed a sentence that is not directly in my training set, such as:

String paragraph = "Take a shot for #harambe he took one for you!";

It will be unable to recognize the hashtag in this example, even though I checked and there is one instance of #harambe being used within my training data.

However, if I pass it a sentence directly from the training data:

String nameParagraph = "Idk whats funnier the #harambe or the fact that Im the only one who will see my page https : t.co/2eWjm6mOon ";

It will be able to recognize #harambe by properly identifying it as a HASHTAG.

I want my model to recognize all hashtags, hence I don't just want to feed it more instances of the #harambe hashtag so that it can recognize that SINGLE hashtag.

Any advice for how I can make my model properly identify new entities that are not within the training set? Thanks in advance!

Upvotes: 2

Views: 890

Answers (1)

HowYaDoing
HowYaDoing

Reputation: 850

I am not sure why you want to statistically model a problem that is deterministic. jbird7 mentioned using regex, you could also:

Tokenize your text using a WhitespaceTokenizer

Iterate through the array of strings looking at the first character of the string (is it #).

\\ create a Tokenizer
Tokenizer tokenizer=WhitespaceTokenizer.INSTANCE
String[] tokens = tokenizer.tokenize(text)

\\ use old-style loop for span creation
List<Span> spans = new ArrayList<Span>()
for (int i=0;i<tokens.length;i++){
   if (token.charAt(0) == '#') spans.add(new Span(i,i+1,"HashTag"))
}
Span[] foundTags = spans.toArray(new Span[spans.size])

At this point, you should have the exact same output as your HashTagNameFinder. Sorry if there are syntax errors. The code should give you an idea of what you want to do.

Upvotes: 0

Related Questions