Named Entity Recognition using Vowpal Wabbit appears to memorise training data

Question

I'm using Vowpal Wabbit's python API to train Named Entity Recognition classifiers to detect names of people, organisations, and locations from short sentences. I've put together an IPython Notebook with details on the data, how models are trained, and entities identified in evaluation sentences. Training data comes from the ATIS and CONLL 2003 datasets.

The setup of my Vowpal Wabbit SearchTask class (based on this tutorial):

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)

        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):
        output = []
        for n in range(len(sentence)):
            pos,word = sentence[n]
            with self.vw.example({'w': [word]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
                output.append(pred)
        return output

Model training:

vw = pyvw.vw(search=num_labels, search_task='hook', ring_size=1024)
#num_labels = 3 ('B'eginning entity, 'I'nside entity, 'O'ther)

sequenceLabeler = vw.init_search_task(SequenceLabeler)    
sequenceLabeler.learn(training_set)

The model performs well on named entities (exact string matches) present in the training data, but generalises poorly to new examples that use the same structure. That is, classifiers will identify entities present in sentences from the training data, but when I ONLY change the names, they do poorly.

sample_sentences = ['new york to las vegas on sunday afternoon', 
                    'chennai to mumbai on sunday afternoon',
                    'lima to ascuncion on sunday afternoon']

The output of this when run the classifier:

new york to las vegas on sunday afternoon
locations - ['new york', 'las vegas']

chennai to mumbai on sunday afternoon
locations - []

lima to ascuncion on sunday afternoon
locations - []

This indicates that even though the sentence remains the same: 'a to b on sunday afternoon', the model cannot identify the new locations, perhaps because it has memorised the training examples?

Similar results hold for the organisation and person classifiers. These can be found in my Github.

My questions are -

What am I doing incorrectly here?
Are there other parameters to the model that I can vary? Or can I better use existing ones such as ring_size and search_task?
Any suggestions you could offer to improve the models' generalisability?

Named Entity Recognition using Vowpal Wabbit appears to memorise training data

Answers (1)

Related Questions