Sam Leurs
Sam Leurs

Reputation: 2020

Insert 1 million rows into PostgreSQL

I'm creating an app in React Native with a Node.js backend. The app is almost done, and I'm now stress testing the backend.

In my PostgreSQL database, I have a table called notifications to store all the notifications a user receives.

In my app, a user can follow pages. When a page posts a new message, I want to send a notification to all users following that page. Every user should receive an individual notification.

Let's say a page is followed by 1 million users, and the page posts a new message: this means 1 million notifications (e.g. 1 million rows) should be inserted into the database.

My solution (for now) is by chunking up the array of user ID's (of the users following the page) into chunks of 1000 user ID's each, and doing an insert query using every chunk.

const db = require('./db');
const format = require('pg-format');

const userIds = [1,2, 3, 4, 5, ..., 1000000];

// split up the user ID's array into chunks of 1000 user ID's each
const chunks = chunkArray(userIds, 1000); // chunkArray is a function that splits up an array 
into multiple arrays with x items, in this case x = 1000;

// loop over each chunk
chunks.forEach(chunk => {
  const array = [];

  // create an array containing 1000 objects, each containing a user ID, notification type and
  // page ID (for inserting into the database)
  chunk.forEach(userId => {
    array.push({ userId, type: 'post', pageId: _PAGE_ID_ });
  });

  // create and run the query
  const query = format("INSERT INTO notifications (userId, type, pageId) VALUES %L", array);
  const result = await db.query(query);
});

I'm using node-postgres for the database connection, and I'm creating a connection pool. I fetch one client from the pool, so only 1 connection is used for all the queries in the forEach-loop.

This all works, but inserting 1 million rows takes a few minutes. I'm not sure this is the right way to do this.

Another solution I came up with is using "general notifications". When a page updates a post, I only insert 1 notification into the notifications table, and when I query for all notifications for a specific user, I check which pages the user is following, and fetch all the general notifications of that page with the query. Would this be a better solution? I would leave me with a lot less notification-rows and I think it would increase performance.


I'm trying to implement my other solution. When a page updates a post, I insert only one notification without a user ID (because it has no specific destination), but with the page ID.

When I fetch all the notifications for a user, I first check for all the notifications with that user ID, and for all notification without a user ID but with a page ID of a page that the user is following.

I think this is not the easiest solution, but it will reduce the number of rows and if I do a good job with indexes and stuff, I think I'm able to write a pretty performant query.

Upvotes: 0

Views: 2083

Answers (3)

arun-r
arun-r

Reputation: 3381

await connection.getRepository('sample').save([...itemList], { chunk: 500 }); you can use this

Upvotes: -1

Daphoque
Daphoque

Reputation: 4678

use async.eachOfLimit to insert X chunks in parallel In the following example you will insert 10 chunks in parallel

const userIds = [1,2, 3, 4, 5, ..., 1000000];
const chunks = chunkArray(userIds, 1000); 

var BATCH_SIZE_X = 10;

async.eachOfLimit(chunks, BATCH_SIZE_X, function(c, i, ecb){

    c = c.map(function(e){ return { e, type: 'post', pageId: _PAGE_ID_ });

   const query = format("INSERT INTO notifications (userId, type, pageId) VALUES %L", c);
  const result = await db.query(query);

  return ecb();

    

}, function(err){
   
   if(err){
 
   }
   else{

   }
});

Upvotes: -1

Jimmy Stenke
Jimmy Stenke

Reputation: 11220

Without getting into which solution would be better, one way to solve it could be this, provided that you keep all the pages and followers in the same database.

INSERT INTO notifications (userId, type, pageID)
SELECT users.id, 'post', pages.id
   from pages 
      join followers on followers.pageId = pages.id
      join users on followers.userId = users.id
    where pages.id = _PAGE_ID_

This would allow the DB to handle everything, which should speed up the insert since you don't need to send each individual row from the server.

If you don't have the users/pages in the same DB then it's a bit more tricky. You could prepare a CSV file, upload it to the database server and use the COPY command. If you don't have access to the server, you might be able to stream the data directly as the COPY command can read from stdin (that depends on the library, I'm not familiar with node-postgres so I can't tell if it's possible.)

Alternatively you can do everything in a transaction by issuing a BEGIN before you do the inserts, this is the slowest alternative, but save time in the overhead of postgres creating an implicit transaction for each statement. Just don't forget to commit after. The library might even have ways to create explicit transactions and insert data through.

That said, I would probably do a variation of your second solution since it would create less rows in the DB, but that depends on your other requirements, it might not be possible if you need to track notifications or perform other actions on them.

Upvotes: 3

Related Questions