Efficient caching solution for strings extracted from large number of text files

Question

For a bunch of text files (all very small with ~100 lines) in a directory, I need to build some string and then pipe everything into fzf so that the user can select one file. The string itself depends on the first few (~20) lines of the file and is built using a couple of very simple regex patterns. Between successive calls, it is expected that only a few files will have changed. I'm looking for some way to do this without noticeable delay (for the user) for about 50k files.

Here is what I did so far: My first solution for this was a naive shell script, namely:

cat $dir/**/* | $process_script | fzf

where $process_script is some Perl script which reads each text file line by line until it has built the required string and then prints it. Already with 1000 files to process, this script is no longer usable as it takes about two seconds and therefore incurs a noticeable delay for the user. So I implemented a poor man's cache by storing the strings in some text file and then only update those lines which have actually changed (based on the mtime of the files). The new script roughly does:

$find_files_with_mtime_newer_than_last_script_run | $process_script | fzf

where $find_files_with_mtime_newer_than_last_script_run runs fd (a fast find replacement) and $process_script is a Perl script of the form

my $cache = slurp($cachefile); #read lines of cachefile into multiline string
my ($string,$id);

while (<>) {

      ($string, $id) = build_string($_); #open file and build string

      $cache = s/^.*$id.*
//; #delete old string from cache

      $cache = $cache . $string; #insert updated string into cache

}

print $cache;

spew($cache, $cachefile); #write cachefile

spew(printf('%s', time),$mtimefile); #store current mtime

Here, slurp, spew and build_string do what is written in the comments. Right now, this solution is fast enough for the user to not notice any delay, but I suspect that this will change again when the number of files grows.

My question As written above, I'm looking for some way to speed this task up. In particular, could you please comment if the following strategy should result in an acceptable (i.e. less than a second) runtime:

Replace the plain text cache file with an SQLite file (or something similar), which stores the built string together with the corresponding filename and its last processing time, then pass the current time to the script, extract all files which need to be updated directly from SQLite without using find or fd and parallelize the processing for those files which need to be updated using gnu parallel.

Of course, I'd also be very thankful for different solutions.

zdim · Accepted Answer

Note The first part has an approach using a cache file, the second one an approach with sqlite, and then there is a comparison between the two.

Whether any one solution is going to be "fast enough" for the purpose depends entirely on all those numbers, of course. So does the best approach to take.

For what you show -- tiny files of which very few change -- basics should be good enough

use warnings;
use strict;
use feature 'say';

my $fcache = 'cache.txt';  # format: filename,epoch,processed_string

open my $fh, '<', $fcache or die "Can't open $fcache: $!";
my %cache = map { chomp; my @f = split /,/, $_, 3;  shift @f => \@f } <$fh>; #/
close $fh;

for (@ARGV) {
    my $mtime = (stat)[9];

    # Have to process the file (and update its record)
    if ( $cache{$_}->[0] < $mtime ) { 
        @{$cache{$_}} = ($mtime, proc_file($_));
    }   

    say $cache{$_}->[1];
}

# Update the cache file
open my $fh_out, '>', $fcache or die "Can't open $fcache: $!";
say $fh_out join(',', $_, @{$cache{$_}}) for keys %cache;
    
sub proc_file {  # token processing: join words with _
    my $content = do { local (@ARGV, $/) = $_[0]; <> };
    return join '_', split ' ', $content;
}

Notes

This will not preserve the order of records in the cache since a hash is used, what doesn't appear to matter. If it is needed then you'd need to know (record) the existing order of lines and then sort like that before writing
Choices of the exact structure of the "cache" file and of the data structure used in the program for it are a little arbitrary, as samples. Improve that, by all means
There must already exist a cache files for the above to work, in a format given in a comment: filename,seconds-since-epoch,string. Add code to write it if it doesn't exist
The biggest consumer here is the line populating a complex data structure from a 50k-line file. That should stay the most time consuming part as long as files are small and only a few need processing

I'd say that involving sqlite would mostly add overhead for such a small problem.

If the number of files to process each time grows beyond a handful then you may want to try it in parallel -- given how small they are the bulk of the time is spent on overhead in accessing files and perhaps there's enough "elbow room" there so to gain from parallel processing. Also, in general I/O certainly can be sped up by running in parallel but that depends entirely on circumstances.

I thought that this was a perfect case to compare with sqlite, as I'm not certain what to expect.

First, I write 50,000 tiny files (a N b) into a separate directory (dir)

perl -wE'for (1..50_000) { open $fh, ">dir/f$_.txt"; say $fh "a $_ b" }'

(always use three-argument open normally!) This took 3 seconds on my old laptop.

Now we need to build a cache file and a (sqlite) database with these files, then update a handful of them, and then compare processing using programs with sqlite and with a cache file.

Here is first the code for the approach using sqlite.

Make and populate the database, in a file files.db

use warnings;
use strict;
use feature 'say';    
use DBI;

my ($dir, $db) = ('dir', 'files.db');
my $dbh = DBI->connect("DBI:SQLite:dbname=$db", '', '', { RaiseError => 1 });

my $table = 'files';
my $qry = qq( create table $table (
    fname   text     not null unique,
    mtime   integer  not null,
    string  text
); );
my $rv = $dbh->do($qry);

chdir $dir or die "Can't chdir to $dir: $!";    
my @fnames = glob "*.txt";

# My sqlite doesn't accept much past 500 rows in single insert (?)
# The "string" that each file is digested into: join words with _
my $tot_inserted = 0;
while (my @part = splice @fnames, 0, 500) {
    my @vals;
    for my $fname ( @part ) { 
        my $str = join '_', 
            split ' ', do { local (@ARGV, $/) = $fname; <> };
        push @vals, "('$fname'," . (stat $fname)[9] . ",'$str')";
    }   
    my $qry = qq(insert into $table (fname,mtime,string) values ) 
            . join ',', @vals;

    $tot_inserted += $dbh->do($qry);
}
say "Inserted $tot_inserted rows";

This took around 13 seconds, a one-time expense. I insert 500 rows at a time since my sqlite won't let me do much more; I don't know why that is so (I've pushed PostgreSQL to a few million rows in a single insert statement). Having the unique constraint on a column gets it indexed.

Now we can change a few timestamps

touch dir/f[1-9]11.txt

and then run a program to update the sqlite database for these changes

use warnings;
use strict;
use feature 'say';    
use DBI;    
use Cwd qw();
use Time::HiRes qw(gettimeofday tv_interval);

my $time_beg = [gettimeofday];

my ($dir, $db) = ('dir', 'files.db');
die "No database $db found
" if not -f $db;    
my $dbh = DBI->connect("DBI:SQLite:dbname=$db", '', '', { RaiseError => 1 });

# Get all filenames with their timestamps (seconds since epoch)
my $orig_dir = Cwd::cwd;
chdir $dir or die "Can't chdir to $dir: $!";
my %file_ts = map { $_ => (stat)[9] } glob "*.txt";

# Get all records from the database and extract those with old timestamps    
my $table = 'files';
my $qry = qq(select fname,mtime,string from $table);    
my $rows = $dbh->selectall_arrayref($qry);
my @new_rows = grep { $_->[1] < $file_ts{$_->[0]} } @$rows;
say "Got ", 0+@$rows, " records, ", 0+@new_rows, " with new timestamps";

# Reprocess the updated files and update the record
foreach my $row (@new_rows) { 
    @$row[1,2] = ( $file_ts{$row->[0]}, proc_file($row->[0]) );
}

printf "Runtime so far: %.2f seconds
", tv_interval($time_beg);  #--> 0.34

my $tot_updated = 0;
$qry = qq(update $table set mtime=?,string=? where fname=?);
my $sth = $dbh->prepare($qry);
foreach my $row (@new_rows) {
    $tot_updated += $sth->execute($sth);
}
say "Updated $tot_updated rows";

$dbh->disconnect;
printf "Runtime: %.2f seconds
", tv_interval($time_beg);  #--> 1.54

sub proc_file {
    return join '_',
        split ' ', do { local (@ARGV, $/) = $_[0]; <> };
}

This expressly doesn't print. I left that out since there are a few ways to do it while I wasn't sure what exactly need be printed. I'd probably run another select for that, after it's all updated.

The program takes, remarkably consistently, around 1.35 seconds averaged over a few runs. But up to the part where it update-s the database for those (few!) changes it takes around 0.35 seconds, and I don't see why the update of a handful of records takes that long in comparison.

Next, in order to compare we need to complete the approach using a cache file from the first part of this post by writing that cache file (what was left out there). The complete program is very slightly different from the one from the beginning

use warnings;
use strict;
use feature 'say';    
use Cwd qw();

my ($dir, $cache) = ('dir', 'cache.txt');
if (not -f $cache) { 
    open my $fh, '>', $cache or die "Can't open $cache: $!";
    chdir $dir or die "Can't chdir to $dir: $!";
    my @fnames = glob "*.txt"; 
    for my $fname (@fnames) { 
        say $fh join ',', $fname, (stat $fname)[9],
            join '_', split ' ', do { local (@ARGV, $/) = $fname; <> };
    }
    say "Wrote cache file $cache, exiting.";
    exit;
}

open my $fh, '<', $cache or die "Can't open $cache $!";
my %fname = map { chomp; my @f = split /,/,$_,3; shift @f => \@f } <$fh>; #/

my $orig_dir = Cwd::cwd;
chdir $dir or die "Can't chdir to $dir: $!";
my @fnames = glob "*.txt";

for my $f (@fnames) {
    my $mtime = (stat $f)[9];

    # Have to process the file (and update its record)
    if ( $fname{$f}->[0] < $mtime ) { 
        @{$fname{$f}} = ($mtime, proc_file($f));
        say "Processed $f, updated with: @{$fname{$f}}";
    }   

    #say $fname{$_}->[1];  # 50k files! suppressed for feasible testing
}
   
# Update the cache
chdir $orig_dir  or die "Can't chdir to $orig_dir: $!";
open my $fh_out, '>', $cache or die "Can't open $cache: $!";
say $fh_out join(',', $_, @{$fname{$_}}) for keys %fname;


sub proc_file {
    return join '_', 
        split ' ', do { local (@ARGV, $/) = $_[0]; <> };
}

Writing the cache initially takes around 1 second. After a few files are touch-ed like in the sqlite test the next run of this program takes, again rather consistently, around 0.45 seconds.

With these tests I have to conclude that the sqlite approach is a bit slower for these conditions. But it is of course far more scalable, while projects only tend to grow in size. Recall also that the update of the database takes quite a bit (relatively), what surprises me; there may be something off with my code and it may be possible to speed that up.

Efficient caching solution for strings extracted from large number of text files

Answers (2)

Related Questions