colster
colster

Reputation: 359

node slow and unresponsive with large data file

I've written a simple node program to parse an excel formatted HTML table returned from a corporate ERP, pull out the data, and save it as JSON.

This uses FS to open the file and Cheerio to extract the data.

The program works fine for small files (<10MB) but takes many minutes for large files (>30MB)

The data file i'm having trouble with is 38MB and has about 30,0000 rows of data.

Question 1: shouldn't this be faster? Question 2: I can only get one console.log statement to output. I can put one statement anywhere and it works, if I add more than one, only the first one outputs anything.

var fs = require('fs');                             // for file system streaming

function oracleParse(file, callback) {

    var headers = [];                               // array to store the data table column headers
    var myError;                                    // module error holder
    var XMLdata = [];                               // array to store the parsed XML data to be returned
    var cheerio = require('cheerio');

    // open relevant file
    var reader = fs.readFile(file, function (err, data) {

        if (err) {

            myError = err;                                      // catch errors returned from file open
        } else {
            $ = cheerio.load(data);                             // load data returned from fs into cheerio for parsing

            // the data retruned from Oracle consists of a variable number of tables however the last one is
            // always the one that contains the data.  We can select this with cheerio and reset the cherrio $ object
            var dataTable = $('table').last();
            $ = cheerio.load(dataTable);

            // table column headers in the table of data returned from Oracle include headers under 'tr td b' elements
            // We extract these headers and load these into the 'headers' array for future use as keys in the JSON
            // data array to be constucted
            $('tr td b').each(function (i, elem) {
                headers.push($(this).text());
            });

            // remove the headers from the cheerio data object so that they don't interfere with the data
            $('tr td b').remove();

            // for the actual data, each row of data (this corresponds to a customer, account, transation record etc) is
            // extracted using cheerio and stored in a key/value object.  These objects are then stored in an array
            var dataElements = [];
            var dataObj = {};
            var headersLength = headers.length;
            var headerNum;
            // the actual data is returned from Oracle in 'tr td nobr' elements.  Using cheerio, we can extract all of
            // these elements although they are not separated into individual rows.  It is possible to return individual
            // rows using cheeris (e.g. 'tr') but this is very slow as cheerio needs to requery each subsequent row.
            // In our case, we simply select all data elements using the 'tr td nobr' selector and then iterate through
            // them, aligning them with the relevant key and grouping them into relevant rows by taking the modulus of
            // the element number returned and the number of headers there are.
            $('tr td nobr').each(function (i, elem) {

                headerNum = i % headersLength;                   // pick which column is associated with each element

                dataObj[headers[headerNum]] = $(this).text();    // build the row object

                // if we find the header number is equal to the header length less one, we have reached the end of
                // elements for the row and push the row object onto the array in which we store the final result
                if (headerNum === headersLength - 1) {
                    XMLdata.push(dataObj);
                    dataObj = {};
                }
            });
            console.log(headersLength);

            // once all the data in the file has been parsed, run the call back function passed in
            callback(JSON.stringify(XMLdata));
        }
    });

    return myError;
}

// parse promo dates data
var file = './data/Oracle/signups_01.html';
var output = './data/Oracle/signups_01.JSON';
//var file = './data/Oracle/detailed_data.html';
//var output = './data/Oracle/detailed_data.JSON';
var test = oracleParse(file, function(data) {
    fs.writeFile(output, data, function(err) {
        if (err) throw err;
        console.log('File write complete: ' + output);
    });
});

console.log(test);

Upvotes: 3

Views: 2532

Answers (1)

fb55
fb55

Reputation: 1227

You might want to check out a streaming solution like substack's trumpet or (shameless self-plug) cornet. Otherwise, you're traversing the document multiple times, which will always take some time.

My guess is that Chrome defers heavy lifting intelligently - you probably only care about the first couple of rows, so that's what you get. Try to include jQuery & run your code, it will still take some time. To be fair, Chrome's DOM isn't garbage collected and therefore will always outperform cheerio.

Upvotes: 2

Related Questions