user3650185
user3650185

Reputation: 33

Sentence segmentation/tokenization with Perl

I am trying to tokenize/segment sentences from a large text. The University of Illinois offers a nice Perl script that splits texts into sentences, but I don't know about its accuracy and I think I should give it a try.

I have downloaded the script and the command-line usage appears to work, but it is not producing the expected results. The input and output files remain the same, although its documentation says that the program checks sentence boundaries and the program output is a text file where each text line corresponds to one sentence.

I am a PHP developer and not well versed with Perl scripts, so can anybody with Perl programming knowledge figure out where the problem lies?

This is the command line I am using (I have renamed the script to boundary.pl)

perl.exe boundary.pl -d HONORIFICS -i input.txt -o output.txt

Upvotes: 0

Views: 439

Answers (1)

Kim Ryan
Kim Ryan

Reputation: 515

There is a perl module to do this from the widely used CPAN library http://search.cpan.org/~kimryan/Lingua-EN-Sentence-0.29/lib/Lingua/EN/Sentence.pm . You can install it with then 'cpan' command line utility that comes with Perl.

You would need to add a small amount of code to create the output of split sentences, but the synopsis shows you most of what you need.

Upvotes: 1

Related Questions