faz
faz

Reputation: 313

How do I use Mallet for my sequence labeling task?

I am trying to incorporate the mallet package into my java code for my sequence labeling task. However, I am not very sure how should I do it with just the data import guideline on the mallet website. Can anybody help me out of it?

My first question is about the import of sequence data. The only data format I see on the website is InstanceList, however, how should we describe sequences with the data structure. For example, if we have multiple sequences (A, B, C are the labels): S1: A B B B B A B B; S2: B A B B B C; S3: C B A B B B. How should I put them into the training data? An InstanceList for S1, an InstanceList for S2 and an InstanceList for S3? And then how do I put them altogether as training data?

My second question is about how to set the features into the instances. I already have the feature weights and the labels, so is there an easy way for me to set the instances? For example, I have features [0.1, 0.2, 0.5, 0.4, 0.1] for an item in the sequence and its label as B, how can I set the features into the Instance structure without going through the multiple pipeline process?

Besides, I am planning to use the CRF model for my sequence labeling task. Besides the labels, I would also want to have the probability of the whole sequence. Is it possible for me to get the information? I saw something like this on the website:

    double logScore = new SumLatticeDefault(crf,inputSeq,outputSeq).getTotalWeight();
    double logZ = new SumLatticeDefault(crf,inputSeq).getTotalWeight();
    double prob = Math.exp(logScore - logZ);

Is this one going to do what I want? And if yes, what would be the inputSeq and outputSeq here?

Upvotes: 2

Views: 709

Answers (1)

David Mimno
David Mimno

Reputation: 1901

The standard input format for a sequence tagging task is one token per line, with sequences separated by a blank line

feature1 feature2 feature3 ... A
feature2 feature4 feature6 ... B

feature1 feature3 feature8 ... C
feature2 feature3 feature4 ... C

In most cases CRF features are assumed to be binary. If you have features with known values you may need to write some additional code. The class SvmLight2FeatureVectorAndLabel may be useful.

The inputSeq variable should be a FeatureVectorSequence and the outputSeq variable should be a LabelSequence. These are the Data and Target fields of an Instance, respectively.

Upvotes: 1

Related Questions