xxx_coder_noscope
xxx_coder_noscope

Reputation: 85

Nodejs best tool for queuing database inserts?

I have an App that have endpoint which receives large .csv file, that need to be parsed, validated, and each row in this file is going to be inserted in database. Because file is too large for immediate response to the user, i just send message back telling that "file in the queue for processing". What is the best tool in my case for inserting 10 to 100k rows in database (postgresql) in the background and be able to get some report if it is succeeded or not? How would whole life cycle of uploading such file would look like?

Upvotes: 0

Views: 1075

Answers (1)

turbopasi
turbopasi

Reputation: 3605

Independend from which tool/s you gonna use specifically for this job I would suggest a lifecycle like the following.

  1. User uploads file to the server, once the file is stored you can safely end the request, along telling the user the file has been uploaded sucessfully and is now being processed. Now it probably depends on the database, but IMHO I wouldn't store a file buffer directly in the database. I would simply store the file on a file storage and only save a reference to the file in the database. I do this with all files in general.

  2. Now on the server there are some sort of processing queues (I say sort of, because there are many possibilities out there). From the top of my head I would suggest 3 processing queues. For now let's not think about the "progress update / report" part.

    - Queue A : File parsing, validating 
    - Queue B : File reading
    - Queue C : Database storing
  1. File has been uploaded, a job will be created and pushed onto the first queue Queue A (File parsing). Queue A will validate the file, and if the task succeeds, it will create the next job for Queue B and pushes it onto Queue B.
  2. Queue B is now reading the file row by row and pushes them to the next Queue, Queue C.
  3. Finally, Queue C stores each row it gets feeded into the database.

Advantages

  • A Queue doesn't have to wait until all others Queues are done with their work. Each Queue is only responsible for their part of the work line.
  • For Example, when Queue A is done validating one file, it can immediately start on the next file. When Queue B is done reading one file, it can immediately start reading the next file, whatever Queue C is already done writing all lines to the database or not.
  • You can decide (per Queue) how many jobs are processed at once. File by File or multiple Files at once (Multithreading, Asynchronous Operations)
  • Splitting it up into multiple "work stations" you can have more insight in what's going on and where.

Progress Updates / Report

Let's asume this is all controlled by the user via a frontend web app.

  • Add real-time updates via a websocket, like what's happening at this moment with the file ("Validating File ...", "Reading File ...", "Saving to Database ...", "45 / 13320 Lines done ...") and so on.
  • Let the server save reports about how many tasks succeeded , how many failed, processing time , statistics. And offer those reports per job via another endpoint.

Without going to much into detail, I hope this gives you a good starting point !

Upvotes: 1

Related Questions