Grene
Grene

Reputation: 25

Perl: Fastest way to find files older than X number of minutes, sorted oldest to newest?

=================

1. Find the files older than X number of minutes

2. Process them from oldest to newest

The Code below works fine, however the directory contains 3 millions of files. Hence I need to optimize it to find the files faster. I don't have to worry about the content of the file just the name.

###########################
sub get_files_to_process{
###########################
# Declare arrays
my @xmlfiles;
my @qulfiedfiles; 

# Declare a Dictionary
my %filedisc;

opendir(my $dh, $maindir) or die "opendir($maindir): $!";

 # Read all the files
 while (my $de = readdir($dh)) {
    # get the Full path of the file
    my $f = $maindir . $de;
    # If File is there and has .xml Extension
    if ( -f $f && $f=~ /\.xml/){
       # Put it in a XMLFILES Array
       push (@xmlfiles, $f); }
    }
    closedir($dh);


 # For every file in directory
 for my $file (@xmlfiles) {

    # Get stats about a file
    my @stats = stat($file);

    # If time stamp is older than minutes provided
    if ($stats[9] <= ($now - (( $minutesold * 60) ))){

       # Put the File and Time stamp in the dictionary
       $filedisc{$file} = $stats[9];
    }
 }

# For every file in the dictionary sort based on the timestamp oldest files first
 for my $x (sort {$filedisc{$a} <=> $filedisc{$b} or $a cmp $b } keys %filedisc) {

    # Put the qualified files (Based on the age) in a list
       push(@qulfiedfiles, $x);}

UPDATE: So far this seems promising, more testing to be done:

##########################
sub get_files_count{
##########################

   my $cmd= "find $maindir -maxdepth 1 -name '*.xml' -mmin +$minutesold -printf \"%T+\t%p\\n\"| sort";
   my @output = `$cmd`;

   if (@output){
      foreach my $line (@output){
            chomp $line;
            push (@files2process, ( split '\t', $line )[ -1 ]);
         }
      }
   }

Upvotes: 2

Views: 4105

Answers (3)

t-al
t-al

Reputation: 1

I know that this is an old question. I am mostly answering it for the "future generations".

Most of your time is very likely spent in sorting the 3 million file entries, because sort operations are non-linear (ie sorting becomes slower and slower the more files you have), and also because most of the stat calls occur in the comparisons, which mostly occur due to sort. (Also the file list will probably occupy quite a chunk of your memory.)

So if you can avoid sorting, you will also automatically avoid most of the stat calls and save a ton of time. Since your task is just "moving the files into appropriate directories" I would simply call the processing method for each file you find that fits your criteria, the moment you find it, as opposed to first creating an enormous list, use a bunch of cycles to sort it, and then going through the enormous list and process it in a way that doesn't necessarily need sorting in the first place.

An example from your own script: "find", unlike say "ls", does not create a file list in memory -- it executes the commands for each file the moment it finds it. Which is why it doesn't explode with enormous directories, unlike "ls". Just do it like find does it ^^

Upvotes: 0

simone
simone

Reputation: 5221

Use File::Find

use File::Find

$\ = "\n";

my @files;

# find all files newer that 9 minutes
File::Find::find({wanted => \&wanted}, '.');

# sort them and print them
print for map { $_-[0] }  sort { $b->[1] <=> $a->[1] } @files;

exit;

sub wanted {
   ((-M) < (9 / (24 * 60))) && -f && push @files, [ $_, ( -M ) ];
}

This is recursive - so it will go through all sub-directories (but I assume from your question there are none).

Also, the above is mostly auto-generated code from find2perl, which translates most of unix find parameters into a perl script - cool and fast.

I haven't tested the -M bit with the 9 minutes - I haven't saved anything in the last 9 minutes.

Upvotes: 2

clt60
clt60

Reputation: 63892

I would solve this in two steps:

1) create an Linux::Inotify2 process, what at every change on the directory would updates some cahce file (like Storable or such)

e.g. you will have an actual cache of all file-stats. Loading one Storable file is faster as gathering stats for 3M files at every run

2) when need search, only load the Storable, a search one big hash...

Upvotes: 1

Related Questions