Reputation: 1837
I'm using Vowpal Wabbit's python API to train Named Entity Recognition classifiers to detect names of people, organisations, and locations from short sentences. I've put together an IPython Notebook with details on the data, how models are trained, and entities identified in evaluation sentences. Training data comes from the ATIS and CONLL 2003 datasets.
The setup of my Vowpal Wabbit SearchTask class (based on this tutorial):
class SequenceLabeler(pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )
def _run(self, sentence):
output = []
for n in range(len(sentence)):
pos,word = sentence[n]
with self.vw.example({'w': [word]}) as ex:
pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
output.append(pred)
return output
Model training:
vw = pyvw.vw(search=num_labels, search_task='hook', ring_size=1024)
#num_labels = 3 ('B'eginning entity, 'I'nside entity, 'O'ther)
sequenceLabeler = vw.init_search_task(SequenceLabeler)
sequenceLabeler.learn(training_set)
The model performs well on named entities (exact string matches) present in the training data, but generalises poorly to new examples that use the same structure. That is, classifiers will identify entities present in sentences from the training data, but when I ONLY change the names, they do poorly.
sample_sentences = ['new york to las vegas on sunday afternoon',
'chennai to mumbai on sunday afternoon',
'lima to ascuncion on sunday afternoon']
The output of this when run the classifier:
new york to las vegas on sunday afternoon
locations - ['new york', 'las vegas']
chennai to mumbai on sunday afternoon
locations - []
lima to ascuncion on sunday afternoon
locations - []
This indicates that even though the sentence remains the same: 'a
to b
on sunday afternoon', the model cannot identify the new locations, perhaps because it has memorised the training examples?
Similar results hold for the organisation
and person
classifiers. These can be found in my Github.
My questions are -
ring_size
and search_task
?Upvotes: 2
Views: 617
Reputation: 2670
You use no gazetteers, no ortographic features (e.g. --spelling
or --affix
), your data is all lowercased, so the only features which can help are unigram and bigram identities. It is no surprise you overfit the training data. Theoretically, you could boost your training data with artificial named entities which follow the patterns (x to y on sunday), but if this could help, it would be easier to build a rule-based classifier.
There are many parameters, e.g. -l
(learning rate) and --passes
. See the tutorial and a list of options.
Note that ring_size
does not influence the prediction quality, you just need to set it high enough that you don't get any warnings (i.e. higher than the longest sequence).
see 1
Upvotes: 4