Jon
Jon

Reputation: 757

More elegant solution to remove items from a batch of files?

Okay, this is more for my own learning than actual need.

I have files with the following format:

Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.8 sec].
Parsing file: chpt1_1.txt
Parsing [sent. 1 len. 42]: [1.1, Organisms, Have, Changed, over, Billions, of, Years, 1, Long, before, the, mechanisms, of, biological, evolution, were, understood, ,, some, people, realized, that, organisms, had, changed, over, time, and, that, living, organisms, had, evolved, from, organisms, no, longer, alive, on, Earth, .]
(ROOT
  (S
    (S
      (NP (CD 1.1) (NNS Organisms))
      (VP (VBP Have)
        (VP (VBN Changed)
          (PP (IN over)
            (NP
              (NP (NNS Billions))
              (PP (IN of)
                (NP (NNP Years) (CD 1)))))
          (SBAR
            (ADVP (RB Long))
            (IN before)
            (S
              (NP
                (NP (DT the) (NNS mechanisms))
                (PP (IN of)
                  (NP (JJ biological) (NN evolution))))
              (VP (VBD were)
                (VP (VBN understood))))))))
    (, ,)
    (NP (DT some) (NNS people))
    (VP (VBD realized)
      (SBAR
        (SBAR (IN that)
          (S
            (NP (NNS organisms))
            (VP (VBD had)
              (VP (VBN changed)
                (PP (IN over)
                  (NP (NN time)))))))
        (CC and)
        (SBAR (IN that)
          (S
            (NP (NN living) (NNS organisms))
            (VP (VBD had)
              (VP (VBN evolved)
                (PP (IN from)
                  (NP
                    (NP (NNS organisms))
                    (ADJP
                      (ADVP (RB no) (RBR longer))
                      (JJ alive))))
                (PP (IN on)
                  (NP (NNP Earth)))))))))
    (. .)))

num(Organisms-2, 1.1-1)
nsubj(Changed-4, Organisms-2)
aux(Changed-4, Have-3)
ccomp(realized-22, Changed-4)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
num(Years-8, 1-9)
advmod(understood-18, Long-10)
dep(understood-18, before-11)
det(mechanisms-13, the-12)
nsubjpass(understood-18, mechanisms-13)
amod(evolution-16, biological-15)
prep_of(mechanisms-13, evolution-16)
auxpass(understood-18, were-17)
ccomp(Changed-4, understood-18)
det(people-21, some-20)

I need to remove all the dependencies (the last section) that aren't important. And then save the new file. Here is my working code:

#!usr/bin/perl
use strict;
use warnings;

##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob("parsed"."*.txt");

foreach my $file (@files) {
my @newfile;
    open(my $parse_corpus, '<', "$file") or die $!;
    while (my $sentences = <$parse_corpus>) {
    #print $sentences, "\n\n";
        if ($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) {
            if ($sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/) {
                push (@newfile, $sentences);
            }

        }
        else {
            push (@newfile, $sentences);
        }
    }
open(FILE ,'>', "select$file" );
print FILE @newfile;
close FILE
}

And a portion of the changed output file:

nsubj(Changed-4, Organisms-2)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
nsubjpass(understood-18, mechanisms-13)
prep_of(mechanisms-13, evolution-16)
nsubj(realized-22, people-21)
nsubj(changed-26, organisms-24)
prep_over(changed-26, time-28)
nsubj(evolved-34, organisms-32)
conj_and(changed-26, evolved-34)
prep_from(evolved-34, organisms-36)
prep_on(evolved-34, Earth-41)

Is there a significantly better way, or one with a more elegant/clever solution?

Thanks for your time, again this is purely for interest, so don't help if you don't have the time.

Upvotes: 0

Views: 65

Answers (1)

DavidO
DavidO

Reputation: 13942

If I understood your logic, you want to default to printing to the outfile unless you come across a 'sentence' that meets a condition. If you meet that first condition, you only want to output to the outfile if a second condition is also true. In that sort of situation I tend to prefer "if this, next unless that" logic, but that's just me. ;) Here's an example with your code.

use strict;
use warnings;
use autodie;

##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob( "parsed" . "*.txt" );

foreach my $file ( @files ) {
    open my $parse_corpus, '<', "$file";
    open my $outfile, '>', "select$file";
    while ( my $sentences = <$parse_corpus> ) {
        if( $sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/ ) {
            next unless $sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/;
        }
        print $outfile $sentences;
    }
}

I made no attempt to refactor your regular expressions. I did find it more pleasing to my sense of efficiency to process the output file line by line at the same time as the input file. This eliminates a second loop, as well as the need for an output array.

Also, I used the autodie pragma instead of specifying 'or die' after each IO operation. And since I used a lexical filehandle on the output file, it closes itself. Combined with autodie, the implicit close is even 'or die' enabled.

Upvotes: 3

Related Questions