Anuj Gupta
Anuj Gupta

Reputation: 6562

Learning2Search (vowpal-wabbit) for NER gives weird results

We are trying to use Learning2Search from vowpal-wabbit for NER We are using ATIS dataset.

In ATIS there are 127 Entities (including Others category) Training set has 4978 and test has 893 sentences.

How ever when we run it on test set it is mapping everything either class 1(Airline name) or class 2(Airport code) Which is wired.

enter image description here

We tried another dataset (https://github.com/glample/tagger/tree/master/dataset), same behavior.

Looks like I am not using it the right way. Any pointers will be of great help.

Code snippet :

with open("/tweetsdb/ner/datasets/atis.pkl") as f:
    train, test, dicts = cPickle.load(f)

idx2words = {v: k for k, v in dicts['words2idx'].iteritems()}
idx2labels = {v: k for k, v in dicts['labels2idx'].iteritems()}
idx2tables = {v: k for k, v in dicts['tables2idx'].iteritems()}


#Convert the dataset into a format compatible with Vowpal Wabbit
training_set = []
for i in xrange(len(train[0])):
    zip_label_ent_idx = zip(train[2][i], train[0][i])
    label_ent_actual = [(int(i[0]), idx2words[i[1]]) for i in zip_label_ent_idx]
    training_set.append(label_ent_actual)


# Do like wise to get test chunk

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)

        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):   
        output = []
        for n in range(len(sentence)):
            pos,word = sentence[n]

            with self.vw.example({'w': [word]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
                output.append(pred)
        return output

vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")

Code for training the model:

#Training
sequenceLabeler = vw.init_search_task(SequenceLabeler)
for i in xrange(3):
    sequenceLabeler.learn(training_set[:10])

Code for Prediction:

pred = []
for i in random.sample(xrange(len(test_set)), 10):
    test_example = [ (999, word[1]) for word in test_set[i] ]
    test_labels  = [ label[0] for label in test_set[i] ]
    print 'input sentence:', ' '.join([word[1] for word in test_set[i]])
    print 'actual labels:', ' '.join([str(label) for label in test_labels])
    print 'predicted labels:', ' '.join([str(pred) for pred in sequenceLabeler.predict(test_example)])

To see the full code, pls refer to this notebook: https://github.com/nsanthanam/ner/blob/master/vowpal_wabbit_atis.ipynb

Upvotes: 2

Views: 595

Answers (1)

acepor
acepor

Reputation: 41

I am also new to this algorithm, but did some pilot studies recently.

To your problem, the answer is that you set a wrong parameter in

vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")

Here, the search should be set as '127', and in this way, vw will use your 127 tags.

vw = pyvw.vw("--search 127 --search_task hook --ring_size 1024")

Also, my feeling is that vw doesn't work really well with so many tags. I might be wrong, please let me know your result :)

Upvotes: 1

Related Questions