Reputation: 277

Need help executing perl tokening script

I'm a Perl amateur. Recently I was given a Perl script that takes a text file and removes all formatting except for the individual words follows by a space. The problem is that the script is unclear how to input a file location. I've set up some code to run through an entire directory of files, but haven't been able to get the code to execute yet. I'll post the original code followed by what I added. Thanks for the help!

Original:

while(<>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

     s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

My Changes:

#!/usr/local/bin/perl

$dirtoget="1999_txt/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD); #
closedir(IMD);
    foreach $f (@thefiles)
    {
        unless ( ($f eq ".") || ($f eq "..") )
        {
            $fr="$dirtoget$f";
            open(FILEREAD, "< $fr");

$x="";
while($line = <FILEREAD>) { $x .= $line; } # read the whole file into one string
close FILEREAD;

print "$x/n";   
while(<$x>) {
    chomp;
    s/\<[^<>]*\>//g;           # eliminate markup
    tr/[A-Z]/[a-z]/;           # downcase

    s/([a-z]+|[^a-z]+)/\1 /g;  # separate letter strings from other types of sequences

    s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

    s/[0-9]+/\#/g;             # map numerical strings to #

    s/\s+/ /g;                 # these three lines clean up white space (so it's always exactly one space between words, no newlines
    s/^\s+//;
    s/\s+$/ /;


    print if(m/\S/);           # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline

}}

Upvotes: 1

Answers (2)

Borodin

Reputation: 126732

Your main problem is that you are opening each file and reading its contents into $x, and then passing $x as a file handle to the original loop. But it's not a file handle -- it's just plain text. If you just omit the reading of the file then your code is close to working

I think this will do as you ask. It uses glob in preference to opendir/readdir because it is more concise

#!/usr/local/bin/perl

use strict;
use warnings;

while ( my $file = glob '1999_txt/*' ) {

    next unless -f $file;

    open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

    while ( <$fh> ) {
        chomp;

        s/<[^<>]*>//g;             # Remove HTML tags
        tr/A-Z/a-z/;               # downcase

        s/([a-z]+|[^a-z]+)/$1 /g;  # separate letter strings from other types of sequences

        s/[^a-z0-9\$\% ]//g;       # delete anything not a letter, digit, $, or %

        s/[0-9]+/#/g;              # map numerical strings to #

        s/\s+/ /g;                 # these three lines clean up whitespace
        s/^\s+//;                  # so it's always exactly one space
        s/\s+$//;                  # between words, no newlines

        print if /\S/;             # print what's left if it's not just whitespace
    }

    print "\n"; # final newline, so whole doc is on one line that ends in newline
}

Upvotes: 1

nowox

Reputation: 29106

You don't really need to edit the original script to apply it to the content of a directory. The shell will be your friend in this case.

Your first script will read every files passed as arguments or, as default, the content of stdin. In other terms you can call your original script like this:

$ ./script file > output
$ cat file | ./script | less

If you want to parse all the files you can still use your shell:

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

It might be clearer with this short example:

Consider a similar script of yours named script:

#!/usr/bin/perl 
while(<>) {
   chomp
   print ">$_<\n";
}
print "\n";

Now, from you shell you can do:

$ mkdir foo && cd foo
$ echo -e "Hello\nYou\nI am A" >> a.txt
$ echo -e "Hello\nYou\nI am A" >> b.txt

$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"

$ ls 
a.txt  a.txt.out  b.txt  b.txt.out  script  script.out
$ cat a.txt.out
>Hello<
>You<
>I am A<

Upvotes: 1

Need help executing perl tokening script

Answers (2)

Related Questions