Reputation: 359
I've written a simple node program to parse an excel formatted HTML table returned from a corporate ERP, pull out the data, and save it as JSON.
This uses FS to open the file and Cheerio to extract the data.
The program works fine for small files (<10MB) but takes many minutes for large files (>30MB)
The data file i'm having trouble with is 38MB and has about 30,0000 rows of data.
Question 1: shouldn't this be faster? Question 2: I can only get one console.log statement to output. I can put one statement anywhere and it works, if I add more than one, only the first one outputs anything.
var fs = require('fs'); // for file system streaming
function oracleParse(file, callback) {
var headers = []; // array to store the data table column headers
var myError; // module error holder
var XMLdata = []; // array to store the parsed XML data to be returned
var cheerio = require('cheerio');
// open relevant file
var reader = fs.readFile(file, function (err, data) {
if (err) {
myError = err; // catch errors returned from file open
} else {
$ = cheerio.load(data); // load data returned from fs into cheerio for parsing
// the data retruned from Oracle consists of a variable number of tables however the last one is
// always the one that contains the data. We can select this with cheerio and reset the cherrio $ object
var dataTable = $('table').last();
$ = cheerio.load(dataTable);
// table column headers in the table of data returned from Oracle include headers under 'tr td b' elements
// We extract these headers and load these into the 'headers' array for future use as keys in the JSON
// data array to be constucted
$('tr td b').each(function (i, elem) {
headers.push($(this).text());
});
// remove the headers from the cheerio data object so that they don't interfere with the data
$('tr td b').remove();
// for the actual data, each row of data (this corresponds to a customer, account, transation record etc) is
// extracted using cheerio and stored in a key/value object. These objects are then stored in an array
var dataElements = [];
var dataObj = {};
var headersLength = headers.length;
var headerNum;
// the actual data is returned from Oracle in 'tr td nobr' elements. Using cheerio, we can extract all of
// these elements although they are not separated into individual rows. It is possible to return individual
// rows using cheeris (e.g. 'tr') but this is very slow as cheerio needs to requery each subsequent row.
// In our case, we simply select all data elements using the 'tr td nobr' selector and then iterate through
// them, aligning them with the relevant key and grouping them into relevant rows by taking the modulus of
// the element number returned and the number of headers there are.
$('tr td nobr').each(function (i, elem) {
headerNum = i % headersLength; // pick which column is associated with each element
dataObj[headers[headerNum]] = $(this).text(); // build the row object
// if we find the header number is equal to the header length less one, we have reached the end of
// elements for the row and push the row object onto the array in which we store the final result
if (headerNum === headersLength - 1) {
XMLdata.push(dataObj);
dataObj = {};
}
});
console.log(headersLength);
// once all the data in the file has been parsed, run the call back function passed in
callback(JSON.stringify(XMLdata));
}
});
return myError;
}
// parse promo dates data
var file = './data/Oracle/signups_01.html';
var output = './data/Oracle/signups_01.JSON';
//var file = './data/Oracle/detailed_data.html';
//var output = './data/Oracle/detailed_data.JSON';
var test = oracleParse(file, function(data) {
fs.writeFile(output, data, function(err) {
if (err) throw err;
console.log('File write complete: ' + output);
});
});
console.log(test);
Upvotes: 3
Views: 2532
Reputation: 1227
You might want to check out a streaming solution like substack's trumpet or (shameless self-plug) cornet. Otherwise, you're traversing the document multiple times, which will always take some time.
My guess is that Chrome defers heavy lifting intelligently - you probably only care about the first couple of rows, so that's what you get. Try to include jQuery & run your code, it will still take some time. To be fair, Chrome's DOM isn't garbage collected and therefore will always outperform cheerio.
Upvotes: 2