\nyou do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
\nHowever, in case the actual workload cannot be easily moved inside the loop (for example, it's a function that expects an array to be processed) you can move the processing elsewhere while keeping the memory footprint low. For this, you can use a generator.
\nReading the PHP doc it fits very well for your case (emphasis mine):
\n\n\nA generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which\nmay cause you to exceed a memory limit, or require a considerable\namount of processing time to generate.
\n
Something like this:
\n\n\nfunction csv_read($filename, $delimeter=',')\n{\n $header = [];\n $row = 0;\n # tip: dont do that every time calling csv_read(), pass handle as param instead ;)\n $handle = fopen($filename, "r"); \n\n if ($handle === false) {\n return false;\n }\n\n while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {\n\n if (0 == $row) {\n $header = $data;\n } else {\n # on demand usage\n yield array_combine($header, $data);\n }\n\n $row++;\n }\n fclose($handle);\n}\n
\nAnd then:
\n$generator = csv_read('rdu-weather-history.csv', ';');\n\ndo_stuff_with_array($generator);\n
\nHere, $generator
variable do not hold all data at once. On the one hand, it behaves much like an array, and can be used inside foreach()
statement. On the other hand, it acts like a stream and reads one item at time.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv
\n","author":{"@type":"Person","name":"felipsmartins"},"upvoteCount":14}}}Reputation:
I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?
Server Specs: Core i7 with 32GB of Ram
function parse_csv($file_name, $delimeter=',') {
$header = false;
$row_count = 0;
$data = [];
// clear any previous results
reset_parse_csv();
// parse
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
$data[] = array_combine($header, $row);
$row_count++;
}
}
fclose($file);
return ['data' => $data, 'row_count' => $row_count];
}
function reset_parse_csv() {
$header = false;
$row_count = 0;
$data = [];
}
Upvotes: 5
Views: 7426
Reputation: 13559
Pushing a large dataset (file lines, etc.) into array increases memory usage and this is directly proportional to the number of items handling. So the bigger file, the bigger memory usage - in this case.
So instead of collecting the data into array, better process it in place:
$file = fopen($file_name, 'r');
while (!feof($file)) {
$row = fgetcsv($file, 0, $delimeter);
if ($row == [NULL] || $row === FALSE) { continue; }
if (!$header) {
$header = $row;
} else {
// do whatever's intended to do with row
// instead of $data[] = array_combine($header, $row);
do_something($row);
}
}
The major difference here is:
you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.
However, in case the actual workload cannot be easily moved inside the loop (for example, it's a function that expects an array to be processed) you can move the processing elsewhere while keeping the memory footprint low. For this, you can use a generator.
Reading the PHP doc it fits very well for your case (emphasis mine):
A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate.
Something like this:
function csv_read($filename, $delimeter=',')
{
$header = [];
$row = 0;
# tip: dont do that every time calling csv_read(), pass handle as param instead ;)
$handle = fopen($filename, "r");
if ($handle === false) {
return false;
}
while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
if (0 == $row) {
$header = $data;
} else {
# on demand usage
yield array_combine($header, $data);
}
$row++;
}
fclose($handle);
}
And then:
$generator = csv_read('rdu-weather-history.csv', ';');
do_stuff_with_array($generator);
Here, $generator
variable do not hold all data at once. On the one hand, it behaves much like an array, and can be used inside foreach()
statement. On the other hand, it acts like a stream and reads one item at time.
P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv
Upvotes: 14
Reputation: 7703
It is not necessary to write a generator function. The SplFileObject also works fine.
$fileObj = new SplFileObject($file);
$fileObj->setFlags(SplFileObject::READ_CSV
| SplFileObject::SKIP_EMPTY
| SplFileObject::READ_AHEAD
| SplFileObject::DROP_NEW_LINE
);
$fileObj->setCsvControl(';');
foreach($fileObj as $row){
//do something
}
I tried that with the file "rdu-weather-history.csv" (> 500KB). memory_get_peak_usage() returned the value 424k after the foreach loop. The values must be processed line by line. If a 2-dimensional array is created, the storage space required for the example increases to more as 8 Mbytes.
Upvotes: 2
Reputation: 4474
One thing you could possibly attempt, is a Bulk Import to MySQL which may give you a better platform to work from once it's imported.
LOAD DATA INFILE '/home/user/data.csv' INTO TABLE CSVImport; where CSVimport columns match your CSV.
Bit of a left field suggestion, but depending on what your use case is can be a better way to parse massive datasets.
Upvotes: 1