Reputation: 5407
Current stanford NER gives mainly 6 classes LOCATION, TIME, PERSON' ORGANIZATION' MONEY' PERCENT' DATE
Additionally it has been trained with English data so could not classify Indian entities.
Is it possible to train the classifier with additional classes so that it can also identify NE as product, month, disease, device
etc.
Also it does not classify Indian entities, so support for such non-english classes too can also be added if this is possible.
Is it possible to retrain classifier, tagger for this additional support?
Upvotes: 0
Views: 677
Reputation: 9332
The major hassle for training the model over other classes is the training data.
Models require highly accurate training data like I brought a <START:product> Mac Book Pro <END> in September and synced it with my <START:device> IPhone <END>.
Observe that Iphone
could be annotated with either device or product.
If you can generate or annotate at least 15,000 sentences annotated with classes you wish to recognise [which is not easy]; you are good to go.
Stanford NER models or OpenNLP NER models don't recognise Indian names because the models are trained on Wall Street journal articles and they are not representative of many names.
Upvotes: 1
Reputation: 718
Also it does not classify Indian entities, so support for such non-english classes too can also be added if this is possible.
By "Indian," do you mean Hindi? Neither Stanford NER nor Apache OpenNLP provide named entity models for Hindi, but GATE has support for basic Hindi named entity recognition: https://gate.ac.uk/sale/tao/splitch15.html#x20-41300015.7
Upvotes: 1
Reputation: 557
One possibility for Indian entities is that the Stanford folk are often happy to add outside training data to the classifiers if it is well formed. For example, two of the three current English models do not recognize "Vihari" in the sentence "Vihari answered my question yesterday." If you compile a list of such sentences and send them to [email protected], they will eventually make their way into a future model.
You will have to label a large amount of data for other classes such as product, device, etc yourself, which is a rather time consuming task. Amazon Mechanical Turk might be of service if you can spare the budget.
Upvotes: 3