nhershy
nhershy

Reputation: 755

Tagging large files with Stanford's Part-Of-Speech Tagger

I am currently using Java and the IntelliJ IDE to run Stanford's POS tagger. I have set it up using this tutorial: (http://new.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/). It is running correctly, however, it only outputs roughly two paragraphs worth of text even when I give it much more content than that (the file I have has a size of 774 KB worth of text).

At the bottom of the tutorial it states this for memory problems:

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

I have configured IntelliJ to use 4GB of memory per this answer: How to increase IDE memory limit in IntelliJ IDEA on Mac?

Yet, it did not change the amount of outputted text in the slightest.

What else could be causing this to happen?

(link to original site of the POS tagger: https://nlp.stanford.edu/software/tagger.shtml)

EDIT:

I have pasted my Main class below. And TaggedWord is a class that helps me parse and organize the relevant pieces of data retrieved from the tagger.

package com.company;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class Main {

    public static void main(String[] args) {

        File infile = new File("C:\\Users\\TEST\\Desktop\\input.txt");
        File outfile = new File("C:\\Users\\TEST\\Desktop\\output.txt");
        MaxentTagger tagger = new MaxentTagger("tagger/english-left3words-distsim.tagger");
        FileWriter fw;
        BufferedWriter bw;
        List<TaggedWord> taggedWords;

        try {
            //read in entire text file to String
            String fileContents = new Scanner(infile).useDelimiter("\\Z").next();

            //erase contents of outfile from previous run
            PrintWriter pw = new PrintWriter(outfile);
            pw.close();

            //tag file contents with parts of speech
            String fileContentsTagged = tagger.tagString(fileContents);

            taggedWords = processTaggedWords(fileContentsTagged);

            fw = new FileWriter(outfile, true); //true = append
            bw = new BufferedWriter(fw);

            String uasiContent = "";
            boolean firstWord = true;
            for (TaggedWord tw : taggedWords) {
                String englishWord = tw.getEng_word();
                String uasiWord = translate(englishWord);
                if (!tw.isPunctuation()) {
                    uasiContent += uasiWord + " ";
                }
                else {
                    //remove last space
                    uasiContent = uasiContent.substring(0, uasiContent.length() - 1);
                    uasiContent += uasiWord + " ";
                }
            }
            bw.write(uasiContent);
            bw.close();
        }
        catch (FileNotFoundException e1) {
            System.out.println("File not found.");
        }
        catch (IOException e) {
            System.out.print("Error writing to file.");
        }
    }  //end main

EDIT2:

I have now modified the line where I am reading in the file to a string using the while-loop, but it still gives me the same results:

        //read in entire text file to String
        String fileContents = "";
        Scanner sc = new Scanner(infile).useDelimiter("\\Z");
        while (sc.hasNext()) {
            fileContents += sc.next();
        }

Upvotes: 1

Views: 477

Answers (1)

Adnan S
Adnan S

Reputation: 1882

Your Scanner is only get called once where it reads the beginning of the input file. To continue, you need to declare Scanner stand-alone and then iterate using a while loop on hasNext() method. See document and example here on declaring and iterating through scanner.

Upvotes: 1

Related Questions