Reputation: 37
I am currently working on a custom named-entitie recognizer so as to recognize 4 types of entitiy: car, equipment, date, issue.
To do so, I use rasa_nlu with NER_crf from sklearn-crfsuite. However, before tagging hundreds of sentences, I asked myself two questions and I haven't found the answers:
I took a look at how does CRF work. From what I understood, the probability for a word w to be classified as an entity e1 depends on the fact that this word has already been tagged e1 in other documents but also on the fact that it follows a word w2 tagged e2 and that we often see words tagged e1 following words tagged e2.
Then, the question is: is it better to prefer entity tagging sequences or entity tagging content ? Is it more interesting to say that a date comes after "on" or that it is composed of "on" so as to detect this date ?
Thank you in advance
Upvotes: 0
Views: 448
Reputation: 15593
You seem to be confused about how NER works. You're trying to train a model so you can write functions that work like this:
sentence = "On Jan 31st. I went to Neptune, and then on Feb 3rd I went to Pluto."
get_dates(sentence) # => ['Jan 31st', 'Feb 3rd']
get_places(sentence) # => ['Neptune', 'Pluto']
In order to train the model, you tag the words you want you want in the function output. So don't tag context around a word. You can think of the tags as examples of the output from your function if it's working correctly.
Is it better to tag "On 31st Jan." or "31st Jan." as a date ?
You don't want "on" so don't tag it. "On" isn't part of a date.
is it better to prefer entity tagging sequences or entity tagging content ?
You tag the content so that the model can learn the sequences. Look at training data for generic NER models.
Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ?
This depends on your application. If you gave your training sentence to your program and asked for a list of equipment, should "rubbers" be in that list? If it is, then you should tag it.
Upvotes: 1