Need serious help optimizing script for memory use

Question

[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]

Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.

I've got a script that produces frequency lists. It essentially does the following:

Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
Eliminates certain lines depending on the content of $item.
Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
Performs a reverse natural sort on the resulting array.
Prints the results to an output file.

The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).

The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.

So I need to somehow optimize the script. And that's what I'm here asking for help with.

Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.

I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.

Thanks in advance!

(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.

#!/usr/bin/env perl

use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);

# DEFINE VARIABLES
my $delimiter            = "\t";
my $split_char           = "\t";

my $input_file_name  = "";
my $output_file_name = "";
my $in_basename      = "";
my $frequency        = 0;
my $item             = "";

# READ COMMAND LINE OPTIONS
GetOptions (
             "input|i=s"         => \$input_file_name,
             "output|o=s"        => \$output_file_name,
           );

# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
    die
      "\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}

# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {

    # STRIP EXTENSION FROM INPUT FILE NAME
    $in_basename = $input_file_name;
    $in_basename =~ s/(.+)\.(.+)/$1/;

    # GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
    $output_file_name = "$in_basename.output.txt";
}

# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
    or die "\nERROR: Can't open input file ($input_file_name): $!";

# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";

# PROCESS INPUT FILE LINE BY LINE
my %F;

while () {

    chomp;

    # PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
    ( $frequency, $item ) = split( /$split_char/, $_, 2 );

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;

    # PROCESS INPUT LINES
    $F{ lc($item) } += $frequency;
}
close INPUTFILE;

# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
    || die "\nERROR: The output file $$output_file_name$ couldn't be opened for writing!\n";

# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
    print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}

close OUTPUTFILE;

exit;

Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.

2   útil    volver  a   valdivia
8   útil    volver  la  vista
1   útil    válvula de  escape
1   útil    vía de  escape
2   útil    vía fax y
1   útil    y   a   cabalidad
43  útil    y   a   el
17  útil    y   a   la
1   útil    y   a   los
21  útil    y   a   quien
1   útil    y   a   raíz
2   útil    y   a   uno

zdim · Accepted Answer

UPDATE In my tests a hash takes 2.5 times the memory that its data alone takes.^† However, the program size for me is consistently 3-4 times as large as its variables. This turns a 6.3Gb data file into a ~15Gb hash, for a ~60Gb program, just as reported in comments.

So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.

There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.

Here is a simple re-write of the core of the program, to try first.

# ... set filenames, variables 
open my $fh_in, '<:encoding(utf8)', $input_file_name
    or die "
ERROR: Can't open input file ($input_file_name): $!";

my %F;    
while (<$fh_in>) {    
    chomp;
    s/^\s*//;                                              #/trim  leading space
    my ($frequency, $item) = split /$split_char/, $_, 2;

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' 
         or not defined $item      or $item =~ /^\s*$/;

    # ... increment counters and aggregates and add to hash
    # (... any other processing?)
    $F{ lc($item) } += $frequency;
}
close $fh_in;

# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name 
    or die "
ERROR: Can't open output file ($output_file_name\: $!";

foreach my $item ( sort { 
        $F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b 
    } keys %F )
{
    print $fh_out $F{$item}, "	", $item, "
";
}
close $fh_out;

A few comments, let me know if more is needed.

Always add $! to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh rather than IN), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).

The sort here has to at least copy its input, and with multiple conditions more memory is needed.

That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.

Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.

Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.

But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.

Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item

open my $fh_out, '>:encoding(utf8)', $output_file_name 
    or die "
ERROR: Can't open output file ($output_file_name\: $!";

my $cumulative_freq;

while (<$fh_in>) {
    chomp;
    s/^\s*//;  #/ leading only
    my ($frequency, $item) = split /$split_char/, $_, 2;

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' 
         or not defined $item      or $item =~ /^\s*$/;

    $cumulative_freq += $frequency;  # would-be hash value

    # Add a sort criterion, $item's length, helpful for later sorting
    say $fh_out $cumulative_freq, "	", length $item, "	", lc($item);

    #say $fh_out $cumulative_freq, "	", lc($item);
}
close $fh_out;

Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal

sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3-  > result

The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.

A systemic solution is to use a database, and a very convenient one is DBD::SQLite.

I used Devel::Size to see memory used by variables.

^† A minimal "scalar" structure, built internally to store a value, takes (on my box and perl) 28 bytes. Then each key and value need a scalar... (Each array and hash structure -- even when anonymous -- take a couple hundred bytes as well, but there's likely way fewer of those in a complex data structure.)

Need serious help optimizing script for memory use

Answers (2)

Related Questions