Craig Foster
Craig Foster

Reputation: 109

Stanford NLP - NER - Train NER with names that have multiple tokens

I have recently started taking a look at Stanford NLP (using the C# port). I have planned on using NER to identify store location names and product names - to do this i will need to train the tagger, which i am in the process of doing.

However, some locations for example "Kings Cross" should only really be considered a location when both tokens are together. i.e i wouldn't want "Kings" getting tagged as a location when it is used by itself in a sentence.

So my question really is: Is there a defined way that it is recommended that I deal with locations/names that have a space in them (both in my training files, and in code)?

Thank you for any help.

Upvotes: 1

Views: 778

Answers (2)

polm23
polm23

Reputation: 15593

The standard way of dealing with this in NER is using IOB tags or some variation. Labels using IOB could look like this:

I        O
went     O
to       O
Kings    B-PLACE
Cross    I-PLACE

Where O means "no label", B-XXX means "beginning of XXX", and I-XXX means "in XXX".

The tagging system will learn multi-word tokens start with a B and sometimes continue with I tokens; it's just another tag transition. To collect multiword tokens from tagger output you just walk through, create an entry for any B, and append the Is to it.

Upvotes: 1

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

Your two options are to train a statistical tagging algorithm and hope it does the correct thing, or to use the regexner annotator and supply it with a list of known named entities. In your list of known named entities for instance, you can include an entry for Kings Cross and it will only work if it sees the full phrase Kings Cross.

More documentation for regexner is available here:

https://nlp.stanford.edu/software/regexner.html

Upvotes: 1

Related Questions