Reputation: 2229
I need to read a large zip file in node-js and process each file (approx 100MB zip file containing ca 40.000 XML files, 500kb each file uncompressed). I am looking for a 'streaming' solution that has acceptable speed and does not require to keep the whole dataset in memory (JSZip, node-zip worked for me, but it keeps everything in RAM and the performance is not good enough). A quick attempt in c# shows that loading, unpacking and parsing the XML can be achieved in approx 9 seconds on 2 year old laptop (using DotNetZip
). I don't expect nodejs to be as fast, but anything under one minute would be okay. Unpacking the file to local disk and then processing it, is not an option.
I am currently attempting to use the unzip
module (https://www.npmjs.org/package/unzip) but can't get it work, so I don't know if the speed is okay, but at least it looks like I can stream each file and process it in the callback. (The problem is that I only receive the first 2 entries, then it stops calling the .on('entry', callback)
callback. I don't get any error, it just silently stops after 2 files. It would also be good to know how I can get the full XML in one chunk instead of fetching buffer after buffer.)
function openArchive(){
fs.createReadStream('../../testdata/small2.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
var fileName = entry.path;
var type = entry.type; // 'Directory' or 'File'
var size = entry.size;
console.log(fileName);
entry.on('data', function(data){
console.log("received data");
});
});
}
There's plenty of node-js modules for working with zip files, so this question is really about to figure out which library is best suited for this scenario.
Upvotes: 2
Views: 5478
Reputation: 23409
Solution for late 2024:
the unzip
package is dated and has deprecated dependencies. unzip-stream
is a still maintained version with up-to date packages.
My solution returns a Promise that resolves only when all the files are extracted, and has an optional callback that can be called after each individual files is extracted, for indicating progress.
it also streams the data to the file instead of saving each individual file in memory and then writing it, like MeatZebre's Answer, which was a concern for me, unzipping large video files.
import unzip from 'unzip-stream';
import fs from "fs";
import path from "path";
let src = '/Volumes/Crucial X6/iCloud-photos-2024/iCloud Photos.zip';
let dest = '/Users/adelphia/test';
let extracted = await processArchive(src, dest, fn=>{
console.log(`Extracted ${path.basename(fn)}`);
});
console.log('done', extracted);
export function processArchive(src, dest, onEach) {
return new Promise(resolve=>{
let promises = [];
fs.createReadStream(src)
.pipe(unzip.Parse())
.on('finish', async ()=>{
let extracted = await Promise.all(promises);
resolve(extracted);
})
.on('entry', function (entry) {
promises.push(new Promise(entryComplete=>{
let filename = path.basename(entry.path);
let dest_path = path.join(dest, filename);
entry.pipe(fs.createWriteStream(dest_path)).on('finish', ()=>{
entry.autodrain();
if(onEach) onEach(dest_path);
entryComplete(dest_path);
});
}));
});
});
}
Upvotes: 1
Reputation: 939
I've had the same task to do: process 100+ MB zip archives with 100 000+ XML files in each of them. In that case, unzipping the files on disk is just a waste of HD space. I tried adm-zip but it would load and expand the whole archive in RAM, and my script would break at around 1 400 MB RAM usage.
Using the code from the question, and the nice tip from Dilan's answer, I was sometimes only getting partial XML content, that would of course break my XML parser.
After some trials, I've ended up with that code:
// process one .zip archive
function process_archive(filename) {
fs.createReadStream(filename)
.pipe(unzip.Parse())
.on('entry', function (entry) {
// entry.path is file name
// entry.type is 'Directory' or 'File'
// entry.size is size of file
const chunks = [];
entry.on('data', (data) => chunks.push(data));
entry.on('error', (err) => console.log(err));
entry.on('end', () => {
let content = Buffer.concat(chunks).toString('utf8');
process_my_file(entry.path, content);
entry.autodrain();
});
});
return;
}
If this can help anybody, it's quite fast and worked well for me, only using a max of 25 MB of RAM.
Upvotes: 4
Reputation: 90
you have to call .autodrain() or pipe data to another stream
entry.on('data', function(data) {
entry.autodrain();
// or entry.pipe(require('fs').createWriteStream(entry.path))
});
Upvotes: 2