Reputation: 313
I am trying to incorporate the mallet package into my java code for my sequence labeling task. However, I am not very sure how should I do it with just the data import guideline on the mallet website. Can anybody help me out of it?
My first question is about the import of sequence data. The only data format I see on the website is InstanceList, however, how should we describe sequences with the data structure. For example, if we have multiple sequences (A, B, C are the labels): S1: A B B B B A B B; S2: B A B B B C; S3: C B A B B B. How should I put them into the training data? An InstanceList for S1, an InstanceList for S2 and an InstanceList for S3? And then how do I put them altogether as training data?
My second question is about how to set the features into the instances. I already have the feature weights and the labels, so is there an easy way for me to set the instances? For example, I have features [0.1, 0.2, 0.5, 0.4, 0.1] for an item in the sequence and its label as B, how can I set the features into the Instance structure without going through the multiple pipeline process?
Besides, I am planning to use the CRF model for my sequence labeling task. Besides the labels, I would also want to have the probability of the whole sequence. Is it possible for me to get the information? I saw something like this on the website:
double logScore = new SumLatticeDefault(crf,inputSeq,outputSeq).getTotalWeight();
double logZ = new SumLatticeDefault(crf,inputSeq).getTotalWeight();
double prob = Math.exp(logScore - logZ);
Is this one going to do what I want? And if yes, what would be the inputSeq and outputSeq here?
Upvotes: 2
Views: 709
Reputation: 1901
The standard input format for a sequence tagging task is one token per line, with sequences separated by a blank line
feature1 feature2 feature3 ... A
feature2 feature4 feature6 ... B
feature1 feature3 feature8 ... C
feature2 feature3 feature4 ... C
In most cases CRF features are assumed to be binary. If you have features with known values you may need to write some additional code. The class SvmLight2FeatureVectorAndLabel
may be useful.
The inputSeq
variable should be a FeatureVectorSequence
and the outputSeq
variable should be a LabelSequence
. These are the Data and Target fields of an Instance
, respectively.
Upvotes: 1