SergkeiM
SergkeiM

Reputation: 4168

updateOrCreate() >100k records

I have a functionality where user can upload a CSV file (List), to store/update emails in DB:

//load CSV
$records = file($request->file('list'), FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

//prepare for mass insert
foreach ($records as $email) {
    $updateOrCreate[] = [
        'email' => $email
    ];
}

DB::beginTransaction();

try {

    $toKeep = [];

    //run a loop to cjeck if email already exists if not create
    foreach ($updateOrCreate as $record) {

        $email = List::updateOrCreate([
            'email' => $record['email']
        ], $record);

        $toKeep[] = $email->id;

    }

    //delete all records that where no in new CSV
    if (count($toKeep)) {
        List::whereNotIn("id", $toKeep)->delete();
    }

} catch (Exception $e) {

    DB::rollBack();

    return response()->json([
        'message' => $e->getMessage()
    ], 422);
}

DB::commit();

This solution works fine, up to few thousands of records, if try to upload 50k and more, is slow, I understand why is slow (Queries each records Select / Update or Insert):

foreach ($updateOrCreate as $record) {

    $email = List::updateOrCreate([
       'email' => $record['email']
    ], $record);

    $toKeep[] = $email->id;     
}

For insert I've used a raw pg_copy_from, works great and fast

$result = pg_copy_from($dbconn, 'lists (email)', $records, ',');

So I was wandering if there is something similar to check if record is not in the list delete it or add if new, with more efficient / faster way, maybe like pg_copy_from.

Upvotes: 0

Views: 373

Answers (1)

Jason
Jason

Reputation: 3030

From what I can see, you're doing this as an AJAX upload of the CSV, and you're probably having timeouts on the processing. The simple fact is, with this volume of records and the actions you are taking, this is going to take a long time.

I recommend rethinking your approach to something like:

  • User uploads file
  • File is stored somewhere in storage
  • A new queued job is set up to process the CSV
  • User response is sent
  • Queued Job executes and is processed on the backend (no timeouts, but watch for out of memory issues)
  • User receives a notification (broadcast, email) about the process when done

As long as you manage the user expectations, this should work.

Upvotes: 3

Related Questions