cooldood3490
cooldood3490

Reputation: 2498

Perl script execution keeps getting killed - running out of memory

I am trying to execute a perl script that processes a small 12 x 2 text file (approx. 260 bytes) and a large .bedgraph file (at least 1.3 MB in size). From these two files, the script outputs a new bedgraph file.

I have ran this script on 3 other .bedgraph files but I try to run it on the rest of them the process keeps getting Killed.

It should take about 20 minutes on average for the perl script to run on each of the .bedgraph files.

I'm running the perl script on my local machine (not from a server). I'm using a Linux OS Ubuntu 12.04 system 64-bit 4GB RAM.

Why does my perl script execution keeps getting killed and how can I fix this?

Here's the script:

# input file handle
open(my $sizes_fh, '<', 'S_lycopersicum_chromosomes.size') or die $!;

# output file handles
open(my $output, '+>', 'tendaysafterbreaker_output.bedgraph') or die $!;

my @array;

while(<$sizes_fh>){
    chomp;
    my ($chrom1, $size) = split(/\t/, $_);
    @array = (0) x $size;

    open(my $bedgraph_fh, '<', 'Solanum_lycopersicum_tendaysafterbreaker.bedgraph') or die $!;
    while(<$bedgraph_fh>){
        chomp;
        my ($chrom2, $start, $end, $FPKM) = split(/\t/, $_);

        if ($chrom1 eq $chrom2){
            for(my $i = $start; $i < $end; $i++){
                $array[$i] += $FPKM;
            }
        }
    }

    close $bedgraph_fh or warn $!;

    my ($last_start, $last_end) = 0;
    my $last_value = $array[0];

    for (my $i = 1; $i < $#array; $i++){
        my $curr_val = $array[$i];
        my $curr_pos = $i;

        # if the current value is not equal to the last value
        if ($curr_val != $last_value){
            my $last_value = $curr_val;
            print $output "$chrom1\t$last_start\t$last_end\t$last_value\n";
            $last_start = $last_end = $curr_pos;
        } else {
            $last_end = $i;
        }
    }
}

close $sizes_fh or warn $!;

Upvotes: 2

Views: 5588

Answers (3)

user1919238
user1919238

Reputation:

You are trying to allocate an array of 90,000,000 elements. Perl, due to its flexible typing and other advanced variable features, uses a lot more memory for this than you would expect.

On my (Windows 7) machine, a program that just allocates such an array and does nothing else eats up 3.5 GB of RAM.

There are various ways to avoid this huge memory usage. Here are a couple:

The PDL module for scientific data processing, which is designed to efficiently store huge numeric arrays in memory. This will change the syntax for allocating and using the array, though (and it messes around with Perl's syntax in various other ways).

DBM::Deep is a module that allocates a database in a file--and then lets you access that database through a normal array or hash:

use DBM::Deep;
my @array;
my $db = tie @array, "DBM::Deep", "array.db";

#Now you can use @array like a normal array, but it will be stored in a database.

Upvotes: 5

amon
amon

Reputation: 57630

If you know a bit of C, it is quite simple to offload the array manipulation into low-level code. Using a C array takes less space, and is a lot faster. However, you loose nice stuff like bounds checking. Here is an implementation with Inline::C:

use Inline 'C';
...;
__END__
__C__
// note: I don't know if your data contains only ints or doubles. Adjust types as needed
int array_len = -1; // last index
int *array = NULL;

void make_array(int size) {
  free(array);
  // if this fails, start checking return value of malloc for != NULL
  array = (int*) malloc(sizeof(int) * size);
  array_len = size - 1;
}

// returns false on bounds error
int array_increment(int start, int end, int fpkm) {
  if ((end - 1) > array_len) return 0;
  int i;
  for (i = start; i < end; i++) {
    array[i] += fpkm;
  }
  return 1;
}

// please check if this is actually equivalent to your code.
// I removed some unneccessary-looking variables.
void loop_over_array(char* chrom1) {
  int
    i,
    last_start = 0,
    last_end   = 0,
    last_value = array[0];
  for(i = 1; i < array_len; i++) { // are you sure not `i <= array_len`?
    if (array[i] != last_value) {
      last_value = array[i];
      // I don't know how to use Perl filehandles from C,
      // so just redirect the output on the command line
      printf("%s\t%d\t%d\t%d\n", chrom1, last_start, last_end, last_value);
      last_start = i;
    }
    last_end = i;
  }
}

void free_array {
  free(array);
}

Minimal testing code:

use Test::More;

make_array(15);
ok !array_increment(0, 16, 2);
make_array(95_000_000);
ok array_increment(0, 3, 1);
ok array_increment(2, 95_000_000, 1);
loop_over_array("chrom");
free_array();
done_testing;

The output of this test case is

chrom   0       1       2
chrom   2       2       1

(with testing output removed). It may take a second to compile, but after that it should be quite fast.

Upvotes: 2

Dave Sherohman
Dave Sherohman

Reputation: 46187

In the records read from $bedgraph_fh, what's a typical value for $start? Although hashes have more overhead per entry than arrays, you may be able to save some memory if @array starts with a lot of unused entries. e.g., If you have an @array of 90 million elements, but the first 80 million are never used, then there's a good chance you'll be better off with a hash.

Other than that, I don't see any obvious cases of this code holding on to data that's not needed by the algorithm it implements, although, depending on your actual objective, it is possible that there may be an alternative algorithm which doesn't require as much data to be held in memory.

If you really need to be dealing with a set of 90 million active data elements, though, then your primary options are going to be either buy a lot of RAM or use some form of database. In the latter case, I'd opt for SQLite (via DBD::SQLite) for simplicity and light weight, but YMMV.

Upvotes: 0

Related Questions