Antoine Deleuze
Antoine Deleuze

Reputation: 37

What is the better way to tag entities for NER using CRF

I am currently working on a custom named-entitie recognizer so as to recognize 4 types of entitiy: car, equipment, date, issue.

To do so, I use rasa_nlu with NER_crf from sklearn-crfsuite. However, before tagging hundreds of sentences, I asked myself two questions and I haven't found the answers:

  1. If you have for example "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction". Is it better to tag "On 31st Jan." or "31st Jan." as a date ? Same question for "the wheels" or "wheels" as an equipment.

I took a look at how does CRF work. From what I understood, the probability for a word w to be classified as an entity e1 depends on the fact that this word has already been tagged e1 in other documents but also on the fact that it follows a word w2 tagged e2 and that we often see words tagged e1 following words tagged e2.

Then, the question is: is it better to prefer entity tagging sequences or entity tagging content ? Is it more interesting to say that a date comes after "on" or that it is composed of "on" so as to detect this date ?

  1. My samples are often a description of the issue such as: "On 31st Jan., the wheels of AA-075-ZP exhibited an increase in friction. This was caused by ... and .... on ... No more impact on the car, the four rubbers have been replaced" Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ? Is it worth to increase the number of occurences for rubber (so that rubber has more chance to be tagged as an equipment) but to give at the same time importance to the pattern "an equipment coming after a lot of words" ?

Thank you in advance

Upvotes: 0

Views: 448

Answers (1)

polm23
polm23

Reputation: 15593

You seem to be confused about how NER works. You're trying to train a model so you can write functions that work like this:

sentence = "On Jan 31st. I went to Neptune, and then on Feb 3rd I went to Pluto."
get_dates(sentence) # => ['Jan 31st', 'Feb 3rd']
get_places(sentence) # => ['Neptune', 'Pluto']

In order to train the model, you tag the words you want you want in the function output. So don't tag context around a word. You can think of the tags as examples of the output from your function if it's working correctly.

Is it better to tag "On 31st Jan." or "31st Jan." as a date ?

You don't want "on" so don't tag it. "On" isn't part of a date.

is it better to prefer entity tagging sequences or entity tagging content ?

You tag the content so that the model can learn the sequences. Look at training data for generic NER models.

Is it interesting to tag "rubbers" as an equipment considering that it comes at the end of a long description and that I most of the time just want to get the first entities in the text ?

This depends on your application. If you gave your training sentence to your program and asked for a list of equipment, should "rubbers" be in that list? If it is, then you should tag it.

Upvotes: 1

Related Questions