Reputation: 7792
So I'm trying to figure out what this perl script does.
use FindBin qw($Bin);
use strict;
use Encode;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
chdir($Bin);
my $dir = "txt";
my $outdir = "aligned";
my $preprocessor = "$Bin/tools/split-sentences.perl -q";
my ($l1,$l2) = @ARGV;
die unless -e "$dir/$l1";
die unless -e "$dir/$l2";
`mkdir -p $outdir/$l1-$l2/$l1`;
`mkdir -p $outdir/$l1-$l2/$l2`;
my ($dayfile,$s1); # globals for reporting reasons
open(LS,"ls $dir/$l1|");
while($dayfile = <LS>) {
chop($dayfile);
if (! -e "$dir/$l2/$dayfile") {
print "$dayfile only for $l1, not $l2, skipping\n";
next;
}
&align();
}
From looking at this I need to run
perl sentence-align-corpus.perl europarlEnglishCorpus.txt europarlSpanishCorpus.txt
where those two files are in a txt folder.
Running the above gives me
txt/europarlEnglishCorpus.txt only for europarlEnglishCorpus.txt, not europarlSpanishCorpus.txt, skipping
And doesn't align sentences, it just creates directories. It looks like that if is being triggered, but I'm not sure what it does.
What does this script do?
Upvotes: 2
Views: 330
Reputation: 3631
The program is assuming the following input in the same directory as the .pl file
txt/
lang-a/
day-1
day-2
lang-b/
day-1
day-2
lang-c/
day-1
day-2
and you then run it as
./sentence-align-corpus.perl lang-a lang-b
I assume that the files mentioned at http://www.statmt.org/europarl/ under Download might be of interest.
There are pointers on this website. These may or may not be helpful, but I would expect you to have read these before asking SO for help.
For a detailed description of this corpus, please read:
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.
Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).
I stick to my original suggestion, email the address given on the website and ask for better instructions as to what else (if anything) you need to download, how to run it and what it aims to achieve.
Upvotes: 3
Reputation: 126742
The command-line parameters are directories. The program expects to find files in txt/p1
and txt/p2
(where p1
and p2
are the parameters passed).
It checks all the files in txt/p1
and either prints the error message you see if there if there is no file of the same name in txt/p2
, or calls the align
subroutine.
You are presumably getting the results you see because there is a file txt/europarlEnglishCorpus.txt
but not one at txt/europarlSpanishCorpus.txt/europarlEnglishCorpus.txt
.
The confusion arises because the program lists the directories by shelling out to ls
, which will take either a filename or a directory name as its parameter.
Beyond this I cannot help you.
Upvotes: 2
Reputation: 15121
It looks like that the second argument (i.e. europarlSpanishCorpus.txt
) you give to this script is wrong, it expects it to be a directory under the directory named txt
.
Upvotes: 1