Floopy-Doo
Floopy-Doo

Reputation: 130

Why does my Perl script die with an "out of memory" exception?

I need to read a 200mb "space"-separated file line-by-line and collect its contents into an array.

Every time I run the script, Perl throws an "out of memory" exception, but I don't understand why!

Some advice please?

#!/usr/bin/perl -w
use strict;
use warnings;

open my $fh, "<", "../cnai_all.csd";
my @parse = ();

while (<$fh>) {
     my @words = split(/\s/,$_);
     push (@parse, \@words);
}

print scalar @parse;

the cnai file looks like this: it contains 11000 rows and 4200 values, seperated by "space", per line.

VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 
VALUE_GROUP_A VALUE_GROUP_B VALUE_GROUP_C 

The code above is just a stripped down sample.
The final script will store all values in a hash and write it to a database later .

But first, I have to solve that memory problem!

Upvotes: 3

Views: 20781

Answers (6)

Floopy-Doo
Floopy-Doo

Reputation: 130

At last I have found a more suitable solution for my problem:

After some research for other parsers I've had to develop, I learned about the module DBD::CSV.

DBD::CSV allows me to select only the needed columns (out of >4000) of the "whitespace"-seperated fields. This reduceses memory usage and perfoms quite well.

More at DBD-CSV @ CPAN.org

Thanks to gbacon I've changed my strategy from reading the whole file in one go to reading it part by part. DBD::CSV makes this possible without much coding.

#!/usr/bin/perl  -w

use strict;
use warnings;
use DBI;
## -------------------------------------------------------------------------##

## -------------------------------------------------------------------------##
## SET GLOBAL CONFIG #############
my $globalConfig = {
                _DIR => qq{../Data},
                _FILES => {
                    'cnai_all.csd'   => '_TEST'
                    }               
                };
## -------------------------------------------------------------------------##


## -------------------------------------------------------------------------##
my $sTime = time();

my $sepChar = " ";
my $csv_dbh = DBI->connect("DBI:CSV:f_dir=".$globalConfig->{_DIR}.";");

$csv_dbh->{csv_eol} ="\n";
#$csv_dbh->{csv_quote_char} ="'";
#$csv_dbh->{csv_escape_char} ="\\";
$csv_dbh->{csv_null} = 1;
$csv_dbh->{csv_quote_char} = '"';
$csv_dbh->{csv_escape_char} = '"';
$csv_dbh->{csv_sep_char} = "$sepChar";
$csv_dbh->{csv_always_quote} = 0;
$csv_dbh->{csv_quote_space} = 0;
$csv_dbh->{csv_binary} = 0;
$csv_dbh->{csv_keep_meta_info} = 0;
$csv_dbh->{csv_allow_loose_quotes} = 0;
$csv_dbh->{csv_allow_loose_escapes} = 0;
$csv_dbh->{csv_allow_whitespace} = 0;
$csv_dbh->{csv_blank_is_undef} = 0;
$csv_dbh->{csv_empty_is_undef} = 0;
$csv_dbh->{csv_verbatim} = 0;
$csv_dbh->{csv_auto_diag} = 0;


my @list = $csv_dbh->func('list_tables');
my $sth = $csv_dbh->prepare("SELECT CELL,NW,BSC,n_cell_0 FROM cnai_all.tmp");


#print join ("\n",@list);

print "\n-------------------\n";

$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
    # just print a hash refrence
    print "$row\n";
}
$sth->finish();

print "\n finish after ".(time()-$sTime)." sec ";

On my machine this runs in roughly 20s and use no more than 10MB memory.

Upvotes: 2

Greg Bacon
Greg Bacon

Reputation: 139521

Rather than reading all 46,200,000 values in core at once, your description of the data in cnai_all.csd as having many rows suggests that each line can be processed independently. If so, use

while (<$fh>) {
  my @words = split /\s/, $_;
  insert_row \@words;
}

where insert_row is a sub you'd define to insert that row into your database.

Note that split /\s/ is often a mistake. The perlfunc documentation on split explains:

As a special case, specifying a PATTERN of space (' ') will split on white space just as split with no arguments does. Thus, split(' ') can be used to emulate awk's default behavior, whereas split(/ /) will give you as many null initial fields as there are leading spaces. A split on /\s+/ is like a split(' ') except that any leading whitespace produces a null first field. A split with no arguments really does a split(' ', $_) internally.

In the nominal case, everything is fine:

  DB<1> x split /\s/, "foo bar baz"
0  'foo'
1  'bar'
2  'baz'

But what if there are multiple spaces between fields? Does that mean an empty field or just a "wide" separator?

  DB<2> x split /\s/, "foo  bar baz"
0  'foo'
1  ''
2  'bar'
3  'baz'

What about leading whitespace?

  DB<3> x split /\s/, " foo bar baz"
0  ''
1  'foo'
2  'bar'
3  'baz'

The default behavior of split isn't arbitrary. Let the tool work for you!

while (<$fh>) {
  insert_row [ split ];
}

Upvotes: 4

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118138

The database you are using probably has a bulk import function. I would try that first.

If you need to do something with each row before putting it into the database (assuming the operations do not require referencing other rows), you should insert the row into the database as soon as processing is complete (turn AutoCommit off) rather than trying to store all the data in memory.

If the processing of each row depends on information in other rows, then you can use Tie::File to treat the input file as an array of lines. Again, don't try to store the contents of each line in memory. When processing is complete, ship it off to the database.

Upvotes: 0

Jonathan Feinberg
Jonathan Feinberg

Reputation: 45324

That would be because... you're running out of memory!

You're not merely storing 200MB of data. You're creating a new list data structure for each line, with all of its associated overhead, and also creating a bunch of separate string objects for each word, with all of their associated overhead.

Edit: As an example of the kind of overhead we're talking about here, each and every value (and that includes strings) has the following overhead:

/* start with 2 sv-head building blocks */
#define _SV_HEAD(ptrtype) \
    ptrtype sv_any;     /* pointer to body */   \
    U32     sv_refcnt;  /* how many references to us */ \
    U32     sv_flags    /* what we are */

#define _SV_HEAD_UNION \
    union {             \
    char*   svu_pv;     /* pointer to malloced string */    \
    IV      svu_iv;         \
    UV      svu_uv;         \
    SV*     svu_rv;     /* pointer to another SV */     \
    SV**    svu_array;      \
    HE**    svu_hash;       \
    GP* svu_gp;         \
    }   sv_u


struct STRUCT_SV {      /* struct sv { */
    _SV_HEAD(void*);
    _SV_HEAD_UNION;
};

So that's at least 4 32-bit values per Perl object.

Upvotes: 6

Arthur Shipkowski
Arthur Shipkowski

Reputation: 3666

Generally this means you are running out of memory for Perl, but it's possible you aren't running out of system memory. First up, there are ways you can get more information on perl's memory usage in the perl debug guts doc -- though you may find yourself recompiling perl, then. (Also note the warning in that doc about perl's hunger for memory...)

However, many operating systems it is possible for memory limits to be set per-process or per-user. If, for example, you're using Linux (or another POSIX system) you might need to alter your ulimits. Type 'ulimit -a' and look at your memory sizes; it's possible your 'max memory size' is below the memory in your machine -- or you have a limited data seg size. You can then reset it with the appropriate option, e.g. ulimit -d 1048576 for a 1GB data seg size limit.

Of course, there is another option: process the file line-by-line, if your situation allows it. (The example code above can be rewritten in such a way.)

Upvotes: 5

Antti Huima
Antti Huima

Reputation: 25522

Your while loop does not read from the file. You should have

<$fh>

or something inside the parentheses.

Upvotes: 2

Related Questions