Karthick R
Karthick R

Reputation: 619

how to train a maxent classifier

[Project stack : Java, Opennlp, Elasticsearch (datastore) , twitter4j to read data from twitter]

I intend to use maxent classifier to classify tweets. I understand that the initial step is to train the model. From the documentation I found that we have a GISTrainer based train method to train the model. I have managed to put together a simple piece of code which makes use of opennlp's maxent classifier to train the model and predict the outcome.

I have used two files postive.txt and negative.txt to train the model

Contents of positive.txt

positive    This is good
positive    This is the best
positive    This is fantastic
positive    This is super
positive    This is fine 
positive    This is nice

Contents of negative.txt

negative    This is bad
negative    This is ugly
negative    This is the worst
negative    This is worse
negative    This sucks

And the java methods below generate the outcome.

@Override
public void trainDataset(String source, String destination) throws Exception {
    File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
    File modelFile = new File(destination);
    Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
    CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
    int cutoff = 5;
    int iterations = 100;
    BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
    DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
    model.serialize(new FileOutputStream(modelFile));
}

@Override
public void predict(String text, String modelFile) {
    InputStream modelStream = null;
    try{
        Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
        String[] tokens = tokenizer.tokenize(text);
        modelStream = new FileInputStream(modelFile);
        DoccatModel model = new DoccatModel(modelStream);
        BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator(); 
        DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
        double[] probs   = categorizer.categorize(tokens);
        if(null!=probs && probs.length>0){
            for(int i=0;i<probs.length;i++){
                System.out.println("double[] probs index  " + i + " value " + probs[i]);
            }
        }
        String label = categorizer.getBestCategory(probs);
        System.out.println("label " + label);
        int bestIndex = categorizer.getIndex(label);
        System.out.println("bestIndex " + bestIndex);
        double score = probs[bestIndex];
        System.out.println("score " + score);
    }
    catch(Exception e){
        e.printStackTrace();
    }
    finally{
        if(null!=modelStream){
            try {
                modelStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

public static void main(String[] args) {
    try {
        String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
        String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
        MaximunEntropyClassifier me = new MaximunEntropyClassifier();
        me.trainDataset(source, outputModelPath);
        me.predict("This is bad", outputModelPath);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

I have the following questions.

1) How do I iteratively train a model? Also, how do I add new sentences/words to the model ? Is there a specific format for the data file? I found that the file needs to have a minimum of two words separated by a tab. Is my understanding valid? 2) Are there any publicly available data sets that I can use to train the model? I found some sources for movie reviews. The project I'm working on involves not just movie reviews but also other things such as product reviews, brand sentiments etc. 3) This helps to an extent. Is there a working example somewhere publicly available? I couldn't find the documentation for maxent.

Please help me out. I am kind'a blocked on this.

Upvotes: 2

Views: 1705

Answers (2)

Viliam Simko
Viliam Simko

Reputation: 1841

AFAIK, you have to completely retrain a MaxEnt model if you want to add new training samples. It cannot be done incrementally on-line.

The default input format for opennlp maxent is a textual file where each line represents a single sample. A sample is composed of tokens (features) delimited by whitespace. During training, the first token represents the outcome.

Take a look at my minimal working example here : Training models using openNLP maxent

Upvotes: 0

Mark Giaconia
Mark Giaconia

Reputation: 3953

1) you can store the samples in a database. I used accumulo once for this. Then at some interval you rebuild the model and reprocess your data. 2) the format is: categoryname space sample newline. No tabs 3) sounds like you want to combine general sentiment with a topic or entity. You could use a name finder or just regex to find the entity or add the entity to your class labels for the doccat include a product name etc , then your samples would have to be very specific

Upvotes: 0

Related Questions