WannabePerlExpert
WannabePerlExpert

Reputation: 19

Parsing one sentence at a time with the Stanford Parser

I have a text file with about 6000 sentences, each on their own line. I want to use the Stanford Parser in the Windows command prompt to parse the sentences. However, I need to send one sentence at a time to the parser (because the sentences are aligned with the sentences of another file).

I want to write a Perl wrapper to: Write one sentence from the input file to a temp file, send the temp file to the parser, parse the one sentence, write the parsed output to an output file, and write the output file to my big output file, ParsedOutput.txt.

This is probable a very basic thing to do, but I’m stuck. Any help or guidance would really be appreciated.

Thank you! :)

Edited: This is what I've tried so far:

open (ENGDATA, "<1tot1660.txt");
open (ENGDATAOUT, ">temp.txt");
while (<ENGDATA>)
{
my $line = $_;
chomp $line;    
while ($line)
    {
    my @OneLine = $line;
    print ENGDATAOUT "$OneLine[0]\n";
    shift(@OneLine);
    }
}

I was thinking: Have each line as an element in an array, write the 0th element to the temp output file, and then remove the first element (so that it won't accidentally be used again). I am basically stuck with the whole program, but for the moment: at writing one line (at a time) to the temp output file.

EDIT! (again.. Thanks, TLP and amon! :) ) This is what I eventually did:

open (ENGDATA, "<Testing10.txt");
open (ENGDATAOUT, ">TempOut.txt");
open (PARSEDOUT, ">ParsedOutput.txt");

while (<ENGDATA>)    
{
    my $line = $_;
    chomp $line;
    my $inputfilename = $line;
    print ENGDATAOUT "$line\n";

    my $parsecommand = qx(java -mx150m -cp "*;" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz $inputfilename);

    print PARSEDOUT "$parsecommand\n";
}

I now get this error for every word in my input:

Parsing file: superior edu.stanford.nlp.process.DocumentPreprocessor: Could not open path superior Parsed file: superior [0 sentences].

What's this all about? Does anyone know and could you maybe help, please? Thanks!

Upvotes: 0

Views: 819

Answers (2)

tripleee
tripleee

Reputation: 189597

Just for the heck of it, a shell script version.

while read -r; do
    printf '%s\n' "$REPLY" >tmp
    parser -input tmp -output tmp2
    cat tmp2
done <input >output
rm tmp tmp2

If the parser can be made to read from standard input and write results to standard output, this can be simplified significantly. On Linux you could use /dev/fd/0 and /dev/fd/1 if it insists on file name arguments.

printf '%s\n' "$REPLY" |
parser -input /dev/fd/0 -output /dev/fd/1

and do away with the temporary files completely.

Upvotes: 1

amon
amon

Reputation: 57640

Ok. Your code seems to be copying the file 1tot1160.txt to temp.txt, introducing some very interesting syntax on the way ;-)

I would do it like this:

  1. Declare all the filenames in one place:

    #!/usr/bin/perl
    use strict; use warnings;
    my $BigInFile     = ...;
    my $BigOutFile    = ...;
    my $ParserInFile  = ...;
    my $ParserOutFile = ...;
    
  2. Open the Big files, and start looping over the input:

    open my $BigIn,  '<', $BigInFile  or die "Cant open $BigInFile";
    open my $BigOut, '>', $BigOutFile or die "Cant open $BigOutFile";
    while (defined(my $line = <$BigIn>)) {
        print $BigOut doStanford($line);
    }
    

    We put each line of the Big Input File into $line while it is defined (read: while we don't have the EOF). Then we print the output of the subroutine doStanford to the Big Output File, assuming it already has an ending newline. If not, feel free to write the code to append it.

  3. Write the subroutine doStanford. We take a line, write it to the Temp file, invoke the programm, and read the other Temp File and return the contents:

    sub doStanford {
        my ($line) = @_; # unpack arguments
    
        # open the firstfile:
        open my $StanfordIn, '>', $ParserInFile
          or die "Couldn't open $ParserInFile";
        print $StanfordIn $line; # already has newline
        close $StanfordIn;
    
        # do the call to the parser. I don't know the interface
        # so I assume it is "parser --in INFILE --out OUTFILE"
        my $returnValue = system("parser",
          "--in", $ParserInFile,
          "--out", $ParserOutFile);
        if ($returnValue != 0) {
            # an error occured
            die "The Parser exited with return value $?: $!.\n";
        }
    
        # read in the other file, and return:
        open my $StanfordOut, '<', $ParserOutFile
          or die "Couldn't open $ParserOutFile";
        my $parsed = <$StanfordOut>; # we only want the first line
        return $parsed;
        # implicit close $StanfordOut
    }
    

There may ;-) be some typos in here, so better write it yourself.

I did some Error Handling for the system call for good style. An exit value of 0 indicates success, non-zero exit values (especially -1) indicate some error or abnormal termination.

If the Parser can output to STDOUT instead of a file, you can execute the command inside qx{}:

my $parsed = qx{parser --in INFILE};

That way we can't do error handling, but we don't need extra files.

Inside the system call, I split the arguments into a list. If we supplied one string only, the Command Line would split it at every space; undesirable if our pathnames include spaces as well. The way I did this, they are safe.

If you can use a module for this, use the module. It is safer and easier.

Edits

  • The return value of system isn't actually the exit status of the command called. The return value is just 0 when the command succeeded, and true on error. The exit status is the value of the expression $? >> 8. $! may be set to a reason.

Upvotes: 0

Related Questions