tag
tag

Reputation: 23

Opening and reading a 2GB csv

I have a been having problems opening and reading the contents of a 2gb csv file. Everytime I run the script it exhausts the servers memory (10GB VPS Cloud Server) and then gets killed. I have made a test script and was wondering if anyone could have a look and confirm that I am not doing anything silly (php wise) here that would cause what seems and unsually high amount of memory usage. I have spoken to my hosting company but they seem to be of the opinion that it is a code problem. So just wondering if anyone can look over this and confirm there is nothing in the code that would cause this kind of problem.

Also if you deal with 2GB csvs, have you encounted anything like this before ?

Thanks

Tim

<?php

ini_set("memory_limit", "10240M");

$start = time();
echo date("Y-m-d H:i:s", $start)."\n";

$file = 'myfile.csv';

$lines = $keys = array();
$line_count = 0;
$csv = fopen($file, "r");

if(!empty($csv))
{
    echo "file open \n";

    while(($csv_line = fgetcsv($csv, null, ',', '"')) !== false)
    {
        if($line_count==0) {
            foreach($csv_line as $item) {
                $keys[] = preg_replace("/[^a-zA-Z0-9]/", "", $item);    
            }
        } else {
            $array = array();
            for ($i = 0; $i <count($csv_line); $i++) {
                $array[$keys[$i]] =  $csv_line[$i]; 
            }
            $lines[] = (object) $array;

            //print_r($array);
            //echo "<br/><br/>";
        }
        $line_count++;
    }

    if ($line_count == 0) {
        echo "invalid csv or wrong delimiter / enclosure ".$file;
    }

} else {
    echo "cannot open ".$file;
}
fclose ($csv);

echo $line_count . " rows \n";

$end = time();
echo date("Y-m-d H:i:s", $end)."\n";

$time = number_format((($end - $start)/60), 2);

echo $time."\n";

echo "peak memory usages ".memory_get_peak_usage(true)."\n";

Upvotes: 1

Views: 1163

Answers (3)

hakre
hakre

Reputation: 198237

As others have already pointed out, you're loading the whole 2 GB file into memory. You do this while creating an array with multiple strings out of each line, so factually the resulting memory needed is more than the plain file-size.

You might want to process each row of the CSV file separately, ideally with an iterator, for example one that returns each line as a keyed array:

$csv = new CSVFile('../data/test.csv');

foreach ($csv as $line) {
    var_dump($line);
}

Exemplary output here:

array(3) {
  ["Make"]=> string(5) "Chevy"
  ["Model"]=> string(4) "1500"
  ["Note"]=> string(6) "loaded"
}
array(3) {
  ["Make"]=> string(5) "Chevy"
  ["Model"]=> string(4) "2500"
  ["Note"]=> string(0) ""
}
array(3) {
  ["Make"]=> string(5) "Chevy"
  ["Model"]=> string(0) ""
  ["Note"]=> string(6) "loaded"
}

This iterator is inspired by one that's build in in PHP called SPLFileObject. As this is an iterator, you decide what you do with each line's/row's data. See the related question: Process CSV Into Array With Column Headings For Key

class CSVFile extends SplFileObject
{
    private $keys;

    public function __construct($file)
    {
        parent::__construct($file);
        $this->setFlags(SplFileObject::READ_CSV);
    }

    public function rewind()
    {
        parent::rewind();
        $this->keys = parent::current();
        parent::next();
    }

    public function current()
    {
        return array_combine($this->keys, parent::current());
    }

    public function getKeys()
    {
        return $this->keys;
    }
}

Upvotes: 2

S&#233;rgio Carvalho
S&#233;rgio Carvalho

Reputation: 1143

PHP is really the wrong language for this. String manipulation usually results in copies of strings being allocated in memory, and garbage collection will only occur when the script ends, when it is really no longer needed. If you know how to do it, and it fits the execution environment, you'd be better with perl or sed/awk.

Having said this, there are two memory hogs on the script. The first is the foreach, which copies the array. Do a foreach on the array_keys, and refer back to the string entry in the array to get at the lines. The second, is the one referred by @YourCommonSense: you should design your algorithm so it works in streaming mode (i.e. not requiring the storage of the full dataset in memory). At a cursory glance, it seems feasible.

Upvotes: 0

Your Common Sense
Your Common Sense

Reputation: 158003

it is not actually an "opening" problem but rather processing problem

I am sure you don't need to keep all the parsed lines in the memory like you currently do.

Why not just put the parsed line wherever it belongs to - a database or another file or anything?

It will make your code to keep in the memory as little as just one line at a time.

Upvotes: 5

Related Questions