Reputation: 277
I'm a Perl amateur. Recently I was given a Perl script that takes a text file and removes all formatting except for the individual words follows by a space. The problem is that the script is unclear how to input a file location. I've set up some code to run through an entire directory of files, but haven't been able to get the code to execute yet. I'll post the original code followed by what I added. Thanks for the help!
Original:
while(<>) {
chomp;
s/\<[^<>]*\>//g; # eliminate markup
tr/[A-Z]/[a-z]/; # downcase
s/([a-z]+|[^a-z]+)/\1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/\#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up white space (so it's always exactly one space between words, no newlines
s/^\s+//;
s/\s+$/ /;
print if(m/\S/); # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
My Changes:
#!/usr/local/bin/perl
$dirtoget="1999_txt/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
@thefiles= readdir(IMD); #
closedir(IMD);
foreach $f (@thefiles)
{
unless ( ($f eq ".") || ($f eq "..") )
{
$fr="$dirtoget$f";
open(FILEREAD, "< $fr");
$x="";
while($line = <FILEREAD>) { $x .= $line; } # read the whole file into one string
close FILEREAD;
print "$x/n";
while(<$x>) {
chomp;
s/\<[^<>]*\>//g; # eliminate markup
tr/[A-Z]/[a-z]/; # downcase
s/([a-z]+|[^a-z]+)/\1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/\#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up white space (so it's always exactly one space between words, no newlines
s/^\s+//;
s/\s+$/ /;
print if(m/\S/); # print what's left
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
}}
Upvotes: 1
Views: 60
Reputation: 126732
Your main problem is that you are opening each file and reading its contents into $x
, and then passing $x
as a file handle to the original loop. But it's not a file handle -- it's just plain text. If you just omit the reading of the file then your code is close to working
I think this will do as you ask. It uses glob
in preference to opendir
/readdir
because it is more concise
#!/usr/local/bin/perl
use strict;
use warnings;
while ( my $file = glob '1999_txt/*' ) {
next unless -f $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( <$fh> ) {
chomp;
s/<[^<>]*>//g; # Remove HTML tags
tr/A-Z/a-z/; # downcase
s/([a-z]+|[^a-z]+)/$1 /g; # separate letter strings from other types of sequences
s/[^a-z0-9\$\% ]//g; # delete anything not a letter, digit, $, or %
s/[0-9]+/#/g; # map numerical strings to #
s/\s+/ /g; # these three lines clean up whitespace
s/^\s+//; # so it's always exactly one space
s/\s+$//; # between words, no newlines
print if /\S/; # print what's left if it's not just whitespace
}
print "\n"; # final newline, so whole doc is on one line that ends in newline
}
Upvotes: 1
Reputation: 29106
You don't really need to edit the original script to apply it to the content of a directory. The shell will be your friend in this case.
Your first script will read every files passed as arguments or, as default, the content of stdin
. In other terms you can call your original script like this:
$ ./script file > output
$ cat file | ./script | less
If you want to parse all the files you can still use your shell:
$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"
It might be clearer with this short example:
Consider a similar script of yours named script
:
#!/usr/bin/perl
while(<>) {
chomp
print ">$_<\n";
}
print "\n";
Now, from you shell you can do:
$ mkdir foo && cd foo
$ echo -e "Hello\nYou\nI am A" >> a.txt
$ echo -e "Hello\nYou\nI am A" >> b.txt
$ ls | xargs -n1 -I{} sh -c "./script {} > {}.out"
$ ls
a.txt a.txt.out b.txt b.txt.out script script.out
$ cat a.txt.out
>Hello<
>You<
>I am A<
Upvotes: 1