Timeout while parsing CSV file

Question

I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.

I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:

$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');

while(($row = fgetcsv($file)) != FALSE){
    // id is the 7th value in the row
    $id = $row[6];
    
    // check table to see if id exists
    $sql = "SELECT id FROM table WHERE id = $id";
    $result = mysqli_query($conn, $sql);
    
    // if id is in the database, skip to next row
    if(mysqli_num_rows($result) > 0){
        continue;
    }
    
    // else write line to temp file
    fputcsv($temp, $row);
}

fclose($file);
fclose($temp);

// overwrite original file
rename($tempFilename, $filename);

Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?

Honk der Hase · Accepted Answer

You fire a database query per line, aka 45.000 queries... that takes too much time.

Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.

Pseudo code:

$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
    $lookup[ $row['id'] ] = $row['id'];
}

// now read CSV
while($row = fgetcsv($h)) {
    $id = $row[6];

    if (isset($lookup[ $id ])) {
        // exist...
        continue;
    }

    // write the non-existing id to different file...
}

edit: Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?

Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database - those ids remaining exist in CSV but not in database.

Pseudo code:

$ids_csv = [];
while($row = fgetcsv($h)) {
    $id = row[6];
    $ids_csv[] = intval($id);
}

$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));

$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
    $ids_db[] = $row['id'];
}

$missing_in_db = array_diff($ids_csv, $ids_db);

Timeout while parsing CSV file

Answers (2)

Related Questions