Reputation: 359
[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
$frequency \t $item
. Any given $item
may occur multiple times, with different values for $frequency
.$item
.$item
s, regardless of their case, and merges these entries into one.The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency
, while all the rest together is $item
.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno
Upvotes: 1
Views: 216
Reputation: 66873
UPDATE In my tests a hash takes 2.5 times the memory that its data alone takes.† However, the program size for me is consistently 3-4 times as large as its variables. This turns a 6.3Gb
data file into a ~15Gb
hash, for a ~60Gb
program, just as reported in comments.
So 6.3Gb == 60Gb
, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.
There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.
Here is a simple re-write of the core of the program, to try first.
# ... set filenames, variables
open my $fh_in, '<:encoding(utf8)', $input_file_name
or die "\nERROR: Can't open input file ($input_file_name): $!";
my %F;
while (<$fh_in>) {
chomp;
s/^\s*//; #/trim leading space
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
# ... increment counters and aggregates and add to hash
# (... any other processing?)
$F{ lc($item) } += $frequency;
}
close $fh_in;
# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
foreach my $item ( sort {
$F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b
} keys %F )
{
print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;
A few comments, let me know if more is needed.
Always add $!
to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh
rather than IN
), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ...
(but it doesn't hurt either).
The sort here has to at least copy its input, and with multiple conditions more memory is needed.
That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.
Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.
Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.
But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.
Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
my $cumulative_freq;
while (<$fh_in>) {
chomp;
s/^\s*//; #/ leading only
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
$cumulative_freq += $frequency; # would-be hash value
# Add a sort criterion, $item's length, helpful for later sorting
say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);
#say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;
Now we can use the system's sort
, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key
, run in a terminal
sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3- > result
The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut
which pulls out the first and third fields from STDIN
(with tab as default delimiter), what is the needed result.
A systemic solution is to use a database, and a very convenient one is DBD::SQLite.
I used Devel::Size to see memory used by variables.
† A minimal "scalar" structure, built internally to store a value, takes (on my box and perl) 28 bytes. Then each key and value need a scalar... (Each array and hash structure -- even when anonymous -- take a couple hundred bytes as well, but there's likely way fewer of those in a complex data structure.)
Upvotes: 3
Reputation: 8591
Sorting input requires keeping all input in memory, so you can't do everything in a single process.
However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.
So just keep the program you have, but add something around it:
Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.
You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.
A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.
Upvotes: 0