Jon
Jon

Reputation: 757

What is the most efficient way to open/act upon all of the files in a directory?

I need to perform my script (a search) on all the files of a directory. Here are the methods which work. I am just asking which is best. (I need file names of form: parsedchpt31_4.txt)

Glob:

my $parse_corpus; #(for all options)
##glob (only if all files in same directory as script?):
my @files = glob("parsed"."*.txt");
foreach my $file (@files) {
    open($parse_corpus, '<', "$file") or die $!;
     ... all my code...
}

Readdir with while and conditions:

##readdir:
my $dir = '.';
opendir(DIR, $dir) or die $!;

while (my $file = readdir(DIR)) {
    next unless (-f "$dir/$file"); ##Ensure it's a file
    next unless ($file =~ m/^parsed.*\.txt/); ##Ensure it's a parsed file
    open($parse_corpus, '<', "$file") or die "Couldn't open directory $!";
     ... all my code...
}

Readdir with foreach and grep:

##readdir+grep:
my $dir = '.';
    opendir(DIR, $dir) or die $!;    
foreach my $file (grep {/^parsed.*\.txt/} readdir (DIR)) {
    next unless (-f "$dir/$file"); ##Ensure it's a file
    open($parse_corpus, '<', "$file") or die "Couldn't open directory $!";
    ... all my code...
}

File::Find:

##File::Find
my $dir = "."; ##current directory: could be (include quotes): '/Users/jon/Desktop/...'
my @files;
find(\&open_file, $dir); ##built in function
sub open_file {
    push @files, $File::Find::name if(/^parsed.*\.txt/);
}
foreach my $file (@files) {
    open($parse_corpus, '<', "$file") or die $!;
     ...all my code...
} 

Is there another way? Is it good to enclose my entire script in the loops? Is it okay I don't use closedir? I'm passing this off to others, I'm not sure where their files will be (may not be able to use glob)

Thanks a lot, hopefully this is the right place to ask this.

Upvotes: 4

Views: 1054

Answers (3)

Joel Berger
Joel Berger

Reputation: 20280

I find that a recursive directory walking function using the perfect partners opendir/readdir and File::chdir (my fav CPAN module, great for cross-platform) allows one to easily and clearly manipulate anything in a directory including subdirectories if desired (if not, omit the recursion).

Example (a simple deep ls):

#!/usr/bin/env perl
use strict;
use warnings;

use File::chdir; #Provides special variable $CWD
# assign $CWD sets working directory
# can be local to a block
# evaluates/stringifies to absolute path
# other great features

walk_dir(shift);

sub do_something {
  print shift . "\n";
}

sub walk_dir {
  my $dir = shift;
  local $CWD = $dir;
  opendir my $dh, $CWD; # lexical opendir, so no closedir needed
  print "In: $CWD\n";

  while (my $entry = readdir $dh) {
    next if ($entry =~ /^\.+$/);
    # other exclusion tests    

    if (-d $entry) {
      walk_dir($entry);
    } elsif (-f $entry) {
      do_something($entry);
    }
  }

}

Upvotes: 1

FMc
FMc

Reputation: 42411

The best or most efficient approach depends on your purposes and the larger context. Do you mean best in terms of raw speed, simplicity of the code, or something else? I'm skeptical that memory considerations should drive this choice. How many files are in the directory?

For sheer practicality, the glob approach works fairly well. Before resorting to anything more involved, I'd ask whether there is a problem.

If you're able to use other modules, another approach is to let someone else worry about the grubby details:

use File::Util qw();
my $fu = File::Util->new;
my @files = $fu->list_dir($dir, qw(--with-paths --files-only));

Note that File::Find performs a recursive search descending into all subdirectories. Many times you don't want or need that.

I would also add that I dislike your two readdir examples because they comingle different pieces of functionality: (1) getting file names, and (2) processing individual files. I would keep those jobs separate.

my $dir = '.';
opendir(my $dh, $dir) or die $!; # Use a lexical directory handle.
my @files = 
    grep { -f }
    map  { "$dir/$_" }
    grep { /^parsed.*\.txt$/ }
    readdir($dh);

for my $file (@files){
    ...
}

Upvotes: 4

TLP
TLP

Reputation: 67900

I think using a while loop is the safer answer. Why? Because loading all the file names into an array could mean a large memory usage, and using line-by-line operation avoids that problem.

I prefer readdir to glob, but that's probably more a matter of taste.

If performance is an issue, one could say that the -f check is unnecessary for any file with the .txt extension.

Upvotes: 1

Related Questions