Reputation: 50
I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.
I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:
$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');
while(($row = fgetcsv($file)) != FALSE){
// id is the 7th value in the row
$id = $row[6];
// check table to see if id exists
$sql = "SELECT id FROM table WHERE id = $id";
$result = mysqli_query($conn, $sql);
// if id is in the database, skip to next row
if(mysqli_num_rows($result) > 0){
continue;
}
// else write line to temp file
fputcsv($temp, $row);
}
fclose($file);
fclose($temp);
// overwrite original file
rename($tempFilename, $filename);
Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?
Upvotes: 1
Views: 539
Reputation: 896
LOAD DATA INFILE
: https://dev.mysql.com/doc/refman/8.0/en/load-data.html FILE
priveleges on the database to use.
to read the csv file into a separate table.Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.
Update: I use LOAD DATA INFILE
with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it's blazingly fast and I would recommend this route for files containing > 100k lines.
Upvotes: 3
Reputation: 2488
You fire a database query per line, aka 45.000 queries... that takes too much time.
Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.
Pseudo code:
$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
$lookup[ $row['id'] ] = $row['id'];
}
// now read CSV
while($row = fgetcsv($h)) {
$id = $row[6];
if (isset($lookup[ $id ])) {
// exist...
continue;
}
// write the non-existing id to different file...
}
edit: Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?
Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff()
the ids from file with the ids from database - those ids remaining exist in CSV but not in database.
Pseudo code:
$ids_csv = [];
while($row = fgetcsv($h)) {
$id = row[6];
$ids_csv[] = intval($id);
}
$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));
$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
$ids_db[] = $row['id'];
}
$missing_in_db = array_diff($ids_csv, $ids_db);
Upvotes: 4