Navaneethan Santhanam
Navaneethan Santhanam

Reputation: 1837

Named Entity Recognition using Vowpal Wabbit appears to memorise training data

I'm using Vowpal Wabbit's python API to train Named Entity Recognition classifiers to detect names of people, organisations, and locations from short sentences. I've put together an IPython Notebook with details on the data, how models are trained, and entities identified in evaluation sentences. Training data comes from the ATIS and CONLL 2003 datasets.

The setup of my Vowpal Wabbit SearchTask class (based on this tutorial):

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)

        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):
        output = []
        for n in range(len(sentence)):
            pos,word = sentence[n]
            with self.vw.example({'w': [word]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
                output.append(pred)
        return output

Model training:

vw = pyvw.vw(search=num_labels, search_task='hook', ring_size=1024)
#num_labels = 3 ('B'eginning entity, 'I'nside entity, 'O'ther)

sequenceLabeler = vw.init_search_task(SequenceLabeler)    
sequenceLabeler.learn(training_set)

The model performs well on named entities (exact string matches) present in the training data, but generalises poorly to new examples that use the same structure. That is, classifiers will identify entities present in sentences from the training data, but when I ONLY change the names, they do poorly.

sample_sentences = ['new york to las vegas on sunday afternoon', 
                    'chennai to mumbai on sunday afternoon',
                    'lima to ascuncion on sunday afternoon']

The output of this when run the classifier:

new york to las vegas on sunday afternoon
locations - ['new york', 'las vegas']

chennai to mumbai on sunday afternoon
locations - []

lima to ascuncion on sunday afternoon
locations - []

This indicates that even though the sentence remains the same: 'a to b on sunday afternoon', the model cannot identify the new locations, perhaps because it has memorised the training examples?

Similar results hold for the organisation and person classifiers. These can be found in my Github.

My questions are -

  1. What am I doing incorrectly here?
  2. Are there other parameters to the model that I can vary? Or can I better use existing ones such as ring_size and search_task?
  3. Any suggestions you could offer to improve the models' generalisability?

Upvotes: 2

Views: 617

Answers (1)

Martin Popel
Martin Popel

Reputation: 2670

  1. You use no gazetteers, no ortographic features (e.g. --spelling or --affix), your data is all lowercased, so the only features which can help are unigram and bigram identities. It is no surprise you overfit the training data. Theoretically, you could boost your training data with artificial named entities which follow the patterns (x to y on sunday), but if this could help, it would be easier to build a rule-based classifier.

  2. There are many parameters, e.g. -l (learning rate) and --passes. See the tutorial and a list of options. Note that ring_size does not influence the prediction quality, you just need to set it high enough that you don't get any warnings (i.e. higher than the longest sequence).

  3. see 1

Upvotes: 4

Related Questions