user123
user123

Reputation: 5407

stanford NER classification with additional classes

Current stanford NER gives mainly 6 classes LOCATION, TIME, PERSON' ORGANIZATION' MONEY' PERCENT' DATE Additionally it has been trained with English data so could not classify Indian entities.

Is it possible to train the classifier with additional classes so that it can also identify NE as product, month, disease, device etc.

Also it does not classify Indian entities, so support for such non-english classes too can also be added if this is possible.

Is it possible to retrain classifier, tagger for this additional support?

Upvotes: 0

Views: 677

Answers (3)

Vihari Piratla
Vihari Piratla

Reputation: 9332

The major hassle for training the model over other classes is the training data.
Models require highly accurate training data like I brought a <START:product> Mac Book Pro <END> in September and synced it with my <START:device> IPhone <END>. Observe that Iphone could be annotated with either device or product.
If you can generate or annotate at least 15,000 sentences annotated with classes you wish to recognise [which is not easy]; you are good to go.
Stanford NER models or OpenNLP NER models don't recognise Indian names because the models are trained on Wall Street journal articles and they are not representative of many names.

Upvotes: 1

Charlie Greenbacker
Charlie Greenbacker

Reputation: 718

Also it does not classify Indian entities, so support for such non-english classes too can also be added if this is possible.

By "Indian," do you mean Hindi? Neither Stanford NER nor Apache OpenNLP provide named entity models for Hindi, but GATE has support for basic Hindi named entity recognition: https://gate.ac.uk/sale/tao/splitch15.html#x20-41300015.7

Upvotes: 1

John
John

Reputation: 557

One possibility for Indian entities is that the Stanford folk are often happy to add outside training data to the classifiers if it is well formed. For example, two of the three current English models do not recognize "Vihari" in the sentence "Vihari answered my question yesterday." If you compile a list of such sentences and send them to [email protected], they will eventually make their way into a future model.

You will have to label a large amount of data for other classes such as product, device, etc yourself, which is a rather time consuming task. Amazon Mechanical Turk might be of service if you can spare the budget.

Upvotes: 3

Related Questions