Reputation: 54003
I've got a system written in Nodejs which first has to import records from very big csv files into a database. Using Sequelize as my ORM I created a simple model as follows:
"use strict";
const Sequelize = require('sequelize');
const sequelize = new Sequelize('mm', 'root', 'password', {
host: 'localhost',
dialect: 'mysql',
logging: true,
pool: {max: 5, min: 0, idle: 100000},
});
const Index = sequelize.define('index', {
value: {type: Sequelize.FLOAT}
});
I then wrote the following code to loop over the lines in the file, interpret the lines, and write them to the database:
let readline = require('readline');
let moment = require('moment');
let lineReader = readline.createInterface({
input: require('fs').createReadStream('files/price_index.csv')
});
lineReader.on('line', function (line) {
let splitted = line.split(',');
let dt = moment(parseInt(splitted[0]));
let value = parseFloat(splitted[1]);
console.log(dt.format(), value);
Index.create({value: value, createdAt: dt});
});
This works ok, but it pauses for about 3 seconds after every 3120 records. I tried both sqlite and mysql, but it always pauses after exactly 3120 records.
Seeing that Sequelize starts the logging of the insert queries also after those 3120 records, I presume the cause of this behaviour is some sort of caching mechanism which puts all queries in a queue until it either has nothing to do, or if it hits that magic query cache limit which is exactly 3120 records.
I tried increasing pool.max
number in the initialisation of Sequelize, but that doesn't seem to make any difference.
Can anybody either confirm my caching idea, or explain me what the true reason is of this behaviour? Can I somehow change this behaviour so that it has a consistant throughput? All tips are welcome!
Upvotes: 2
Views: 423
Reputation: 74881
I think 3120 lines will be the high water mark for the createReadStream
buffer which is 64KiB. Node will back off reading when the buffer is full.
It looks like the 3120 line
events all run on the same Node event tick so you get 3120 lines processed and 3120 async Index.create
calls scheduled for the next tick. So you end up with huge chunks of processing to do on each side. Either reading and scheduling the queries, or dealing with the mass of scheduled queries.
When the 3120 line
event functions are done, some garbage collection happens and the 3120 sequelize create
calls that were scheduled get a chance to do their thing. This is the "pause" in the data, but Node is still processing. All the create
calls take a few seconds to complete, then some more garbage collection and back to the next chunk of csv data and all it's line
events. The process goes back and forth like that.
In a csv file with 10000 lines I see ~ 3 queries able to run before all 10000 rows of csv data have been read and scheduled for insert.
You probably want to use a Readable Stream with smaller chunks. Then block reads based on sequelize inserts completing. You may need to do the line processing your self rather than using readline
. If the csv file fits in memory just read the whole thing in as the scheduling will be easier.
Maybe use something like queue
to manage the inserts, allowing up to your sequelize pool max
as the concurrency
. Then once the length
of the queue is low enough, allow reads to happen again.
I don't know if the end result will be much quicker though, might end up pretty similar.
Upvotes: 2