Reputation: 305
I am using the NPM 'xlsx' library to process large XLSX files (10-20MB) in Node.js / Javascript. This has slow performance because it reads the entire XLSX file before operating, rather than only reading what it needs. In this case, I only need to read a single worksheet with the title 'target-sheet' and the rest of the file can be ignored.
Is there a way to delete the sheets other than 'target-sheet' without reading the rest of the file? Overall, I am looking for a way to ignore the other worksheets at the outset, before their contents are read, in order to improve efficiency.
Thanks so much
Upvotes: 1
Views: 1367
Reputation: 3687
If you are sensitive about performance and memory usage. Even if you tell the xlsx
module to process only the required sheet, you will still have a memory problem. Although there is no charge for the processing of other sheets, I am almost sure that for the first time it is necessary to read the file completely from the disk and keep it in memory.
In this case, it is true that no cost is given for processing other sheets, but the file is still read from the disk once.
Hence, I implemented a test to prove this case, I created a file of approximately 20 megabytes containing 5 sheets, 4 of which have 50,000 records each, and the other sheet has only one record.
Using memoryUsage, I measured the amount of memory during processing, and the result is summarized below:
Without passing sheet:
import XLSX from 'xlsx';
// Convert memory usage to MB
const formatMemoryUsage = (data) => `${Math.round(data / 1024 / 1024 * 100) / 100} MB`;
// Generate human readable output
const generateMemoryUsageOutput = (memoryData) => ({
rss: `${formatMemoryUsage(memoryData.rss)} -> Resident Set Size - total memory allocated for the process execution`,
heapTotal: `${formatMemoryUsage(memoryData.heapTotal)} -> total size of the allocated heap`,
heapUsed: `${formatMemoryUsage(memoryData.heapUsed)} -> actual memory used during the execution`,
external: `${formatMemoryUsage(memoryData.external)} -> V8 external memory`,
});
console.log('before', generateMemoryUsageOutput(process.memoryUsage()));
XLSX.readFile('SampleXLSFile_6800kb.xls');
console.log('after', generateMemoryUsageOutput(process.memoryUsage()));
/* ------------------------- result --------------------------------- */
/*
before {
rss: '67.77 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '27.64 MB -> total size of the allocated heap',
heapUsed: '21.12 MB -> actual memory used during the execution',
external: '6.44 MB -> V8 external memory'
}
after {
rss: '917.89 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '819.21 MB -> total size of the allocated heap',
heapUsed: '790.39 MB -> actual memory used during the execution',
external: '60.29 MB -> V8 external memory'
}
*/
With passing sheet:
import XLSX from 'xlsx';
// Convert memory usage to MB
const formatMemoryUsage = (data) => `${Math.round(data / 1024 / 1024 * 100) / 100} MB`;
// Generate human readable output
const generateMemoryUsageOutput = (memoryData) => ({
rss: `${formatMemoryUsage(memoryData.rss)} -> Resident Set Size - total memory allocated for the process execution`,
heapTotal: `${formatMemoryUsage(memoryData.heapTotal)} -> total size of the allocated heap`,
heapUsed: `${formatMemoryUsage(memoryData.heapUsed)} -> actual memory used during the execution`,
external: `${formatMemoryUsage(memoryData.external)} -> V8 external memory`,
});
console.log('before', generateMemoryUsageOutput(process.memoryUsage()));
XLSX.readFile('SampleXLSFile_6800kb.xls', { sheets: [ 'Sheet5' ] });
console.log('after', generateMemoryUsageOutput(process.memoryUsage()));
/* ------------------------- result --------------------------------- */
/*
before {
rss: '68.04 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '28.39 MB -> total size of the allocated heap',
heapUsed: '21.2 MB -> actual memory used during the execution',
external: '6.44 MB -> V8 external memory'
}
after {
rss: '917.44 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '819.46 MB -> total size of the allocated heap',
heapUsed: '790.72 MB -> actual memory used during the execution',
external: '49.27 MB -> V8 external memory'
}
*/
As you can see, the amount of memory used is almost the same for both.
But in my opinion, the best solution for such issues is to use the NodeJS child_process.
In this way, the main process is not busy and the responsibility for its processing is given to the child.
Hence, I used the fork method. According to the Node.js documentation:
Keep in mind that spawned Node.js child processes are independent of the parent with exception of the IPC communication channel that is established between the two. Each process has its own memory, with their own V8 instances.
using fork
method:
import { fork } from 'child_process';
// Convert memory usage to MB
const formatMemoryUsage = (data) => `${Math.round(data / 1024 / 1024 * 100) / 100} MB`;
// Generate human readable output
const generateMemoryUsageOutput = (memoryData) => ({
rss: `${formatMemoryUsage(memoryData.rss)} -> Resident Set Size - total memory allocated for the process execution`,
heapTotal: `${formatMemoryUsage(memoryData.heapTotal)} -> total size of the allocated heap`,
heapUsed: `${formatMemoryUsage(memoryData.heapUsed)} -> actual memory used during the execution`,
external: `${formatMemoryUsage(memoryData.external)} -> V8 external memory`,
});
console.log('before', generateMemoryUsageOutput(process.memoryUsage()));
const child = fork('child.js');
child.on("close", function (code) {
console.log('after', generateMemoryUsageOutput(process.memoryUsage()));
});
/* ------------------------- result --------------------------------- */
/*
before {
rss: '33.84 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '4.67 MB -> total size of the allocated heap',
heapUsed: '3.96 MB -> actual memory used during the execution',
external: '0.37 MB -> V8 external memory'
}
before (child) {
rss: '67.41 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '27.89 MB -> total size of the allocated heap',
heapUsed: '21.1 MB -> actual memory used during the execution',
external: '6.53 MB -> V8 external memory'
}
after (child) {
rss: '918.89 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '819.4 MB -> total size of the allocated heap',
heapUsed: '784.71 MB -> actual memory used during the execution',
external: '59.71 MB -> V8 external memory'
}
after {
rss: '34.88 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '5.18 MB -> total size of the allocated heap',
heapUsed: '4.55 MB -> actual memory used during the execution',
external: '0.39 MB -> V8 external memory'
}
*/
child.js:
import XLSX from 'xlsx';
// Convert memory usage to MB
const formatMemoryUsage = (data) => `${Math.round(data / 1024 / 1024 * 100) / 100} MB`;
// Generate human readable output
const generateMemoryUsageOutput = (memoryData) => ({
rss: `${formatMemoryUsage(memoryData.rss)} -> Resident Set Size - total memory allocated for the process execution`,
heapTotal: `${formatMemoryUsage(memoryData.heapTotal)} -> total size of the allocated heap`,
heapUsed: `${formatMemoryUsage(memoryData.heapUsed)} -> actual memory used during the execution`,
external: `${formatMemoryUsage(memoryData.external)} -> V8 external memory`,
});
console.log('before (child)', generateMemoryUsageOutput(process.memoryUsage()));
XLSX.readFile('SampleXLSFile_6800kb.xls', { sheets: [ 'Sheet5' ] });
console.log('after (child)', generateMemoryUsageOutput(process.memoryUsage()));
Upvotes: 2
Reputation: 11
Assuming you are talking about this package and using XLSX.read
or XLSX.readFile
, then you want to pass the sheets
option (docs here). So, something like:
XLSX.readFile(filename, { sheets: [ 'target-sheet' ] })
Upvotes: 1