praks5432
praks5432

Reputation: 7792

Perl script for aligning corpora

So I'm trying to figure out what this perl script does.

use FindBin qw($Bin);
use strict;
use Encode;

binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

chdir($Bin);
my $dir = "txt";
my $outdir = "aligned";
my $preprocessor = "$Bin/tools/split-sentences.perl -q";

my ($l1,$l2) = @ARGV;
die unless -e "$dir/$l1";
die unless -e "$dir/$l2";

`mkdir -p $outdir/$l1-$l2/$l1`;
`mkdir -p $outdir/$l1-$l2/$l2`;

my ($dayfile,$s1); # globals for reporting reasons
open(LS,"ls $dir/$l1|");
while($dayfile = <LS>) {
  chop($dayfile);
  if (! -e "$dir/$l2/$dayfile") {
    print "$dayfile only for $l1, not $l2, skipping\n";
    next;
  }
  &align();
}

From looking at this I need to run

perl sentence-align-corpus.perl europarlEnglishCorpus.txt europarlSpanishCorpus.txt

where those two files are in a txt folder.

Running the above gives me

txt/europarlEnglishCorpus.txt only for europarlEnglishCorpus.txt, not europarlSpanishCorpus.txt, skipping

And doesn't align sentences, it just creates directories. It looks like that if is being triggered, but I'm not sure what it does.

What does this script do?

Upvotes: 2

Views: 330

Answers (3)

justintime
justintime

Reputation: 3631

The program is assuming the following input in the same directory as the .pl file

txt/
  lang-a/
     day-1
     day-2
  lang-b/
     day-1
     day-2
  lang-c/
     day-1
     day-2

and you then run it as

./sentence-align-corpus.perl lang-a lang-b

I assume that the files mentioned at http://www.statmt.org/europarl/ under Download might be of interest.

There are pointers on this website. These may or may not be helpful, but I would expect you to have read these before asking SO for help.

For a detailed description of this corpus, please read:

  • Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

  • Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

I stick to my original suggestion, email the address given on the website and ask for better instructions as to what else (if anything) you need to download, how to run it and what it aims to achieve.

Upvotes: 3

Borodin
Borodin

Reputation: 126742

The command-line parameters are directories. The program expects to find files in txt/p1 and txt/p2 (where p1 and p2 are the parameters passed).

It checks all the files in txt/p1 and either prints the error message you see if there if there is no file of the same name in txt/p2, or calls the align subroutine.

You are presumably getting the results you see because there is a file txt/europarlEnglishCorpus.txt but not one at txt/europarlSpanishCorpus.txt/europarlEnglishCorpus.txt.

The confusion arises because the program lists the directories by shelling out to ls, which will take either a filename or a directory name as its parameter.

Beyond this I cannot help you.

Upvotes: 2

Lee Duhem
Lee Duhem

Reputation: 15121

It looks like that the second argument (i.e. europarlSpanishCorpus.txt) you give to this script is wrong, it expects it to be a directory under the directory named txt.

Upvotes: 1

Related Questions