Working with named-entity datasets

Question

I am working on a classification task where we are building models that detect the type of an entity present in a span of text (ie, annotation). These models can be built with a dataset where each instance is represented by three independent text variables:

pre-context: document text before the annotation.
annotation: span of the document where we want to detect the entity type. If no entity exists, all the entity type columns (isPerson, isOrganization, isTime) are marked 0
post-context: document text after the annotation.

Data Set 1: Entity type classification in spans of text.

preContext  | annotation       | postContext | isOrganization | isPerson | isTime 
....        | on July 12, 2011 | ....        | 0              | 0        | 1 
With over 8 | million invested | in Chrysler | 0              | 0        | 0

Data Set 2: Boundary detection - "start-of-entity"

In the first example, the transition between preContext and text marks the start of an organization-type entity. In the second example, there is no entity present at the transition between preContext and text, therefore all of the dependent variable columns are marked as zero.

preContext          | text
    | isStartOfOrganization | isStartOfPerson | isStartOfTime
Private equity firm | Westbridge Capital could exit part or all of its stake in Hyderabad-based technology firm.
    | 1 | 0 | 0

I been using basic NLP techniques like TF/IDF, N-grams, Tokenizers, Stemmers, POS Taggers, Stoplist for the above problem. But I now really want to do is to experiment with some new technique other than what I tried. This is my Problem and I couldn't able to find any valid techniques. If you can suggest me It will be great i.e The only way to make significant further gains is to start to start thinking outside the box!. Could you please suggest me some new techniques for solving above problems?

Ben Allison · Accepted Answer

Named Entity Recognition is one of the canonical sequence labelling tasks. The way this is normally done, which you're close to but a bit different to, is to attach a tag to each word in your sentence. Something like the following:

With/NONE over/NONE eight/NONE million/NONE invested/NONE in/NONE Chrysler/BEGIN-COMPANY

On/NONE Tuesday/NONE, Mr/BEGIN-PERSON X/PERSON, CEO/NONE of/NONE Technology/BEGIN-COMPANY Products/COMPANY Inc/COMPANY, said/NONE ...

I believe having a separate start tag is common (BEGIN-COMPANY vs COMPANY) as it helps to learn the transition between NONE and a category. You could try both though.

You don't then want to approach this as a bunch of independent decisions (classification), as the decisions you make are interdependent. Instead, use a specific sequence-labelling model. The most general and easiest to pick up, if you have access to a toolkit (many are available) is a Conditional Random Field, as you can define arbitrary feature functions for each word without worrying about distributional assumptions. Common features are word-id, previous tag id, whether the word is capitalized, whether it's present in various lists of proper nouns, etc. You can then learn the model from labelled data and apply it to new texts.

Working with named-entity datasets

Answers (1)

Related Questions