Zag Gol
Zag Gol

Reputation: 1076

Nodejs - removing substring from a huge file

I need to remove a substring (that appears only in specific known lines of the file) from a file.

there are simple solutions of reading all file data to a string, removing the substring, and then write the fixed data to the file.

here is a code I found in here:

Node js - Remove string from text file

var data = fs.readFileSync('banlist.txt', 'utf-8');
var newValue = data.replace(new RegEx("STRING_TO_REMOVE"), '');
fs.writeFileSync('banlist.txt', newValue, 'utf-8');

My problem is, that the file is huge - up to billion lines of logs, so I can't read all content to the memory.

Upvotes: 1

Views: 1508

Answers (5)

Lukas C
Lukas C

Reputation: 453

Why not a simple transform stream and replace()? replace can take a callback as second parameter i.e. .replace(/bad1|bad2|bad3/g, filterWords) in case you need to replace words rather than remove them completely.

const fs = require("fs")
const { pipeline, Transform } = require("stream")
const { join } = require("path")

const readFile = fs.createReadStream("./words.txt")
const writeFile = fs.createWriteStream(
  join(__dirname, "words-filtered.txt"),
  "utf8"
)

const transformFile = new Transform({
  transform(chunk, enc, next) {
    let c = chunk.toString().replace(/bad/g, "replaced")
    this.push(c)
    next()
  },
})

pipeline(readFile, transformFile, writeFile, (err) => {
  if (err) {
    console.log(err.message)
  }
})

Upvotes: 3

Ahmed ElMetwally
Ahmed ElMetwally

Reputation: 2383

You can use this code to do it. I'm using fs stream. it's created for read huge files in small memory by chunks. docs

const fs = require('fs');

const readStream = fs.createReadStream('./XXXXX');
const writeStream = fs.createWriteStream('./XXXXXXX');

readStream.on('data', (chunk) => {
  const data = chunk.toString().replace('STRING_TO_REMOVE', 'XXXXXX');
  writeStream.write(data);
});

readStream.on('end', () => {
  writeStream.close();
});

Upvotes: 1

Matthew Howard
Matthew Howard

Reputation: 179

What you probably want to do is use streams so that you are writing after partial reads. this example could probably work for you. you need to copy over the output text file ".tmp" over the original to get the same behavior in your question. It works by reading a chunk and then looking to see if you've come across a new line. then it processes that line, writes it, then removes it from the buffer. This should help with your memory problem.

var fs = require("fs");
var readStream = fs.createReadStream("./BFFile.txt", { encoding: "utf-8" });
var writeStream = fs.createWriteStream("./BFFile.txt.tmp");

const STRING_TO_REMOVE = "badword";
var buffer = ""

readStream.on("data", (chunk) => {
    buffer += chunk;
    var indexOfNewLine = buffer.search("\n");
    while (indexOfNewLine !== -1) {
        var line = buffer.substring(0, indexOfNewLine + 1);
        buffer = buffer.substring(indexOfNewLine + 1, buffer.length);
        line = line.replace(new RegExp(STRING_TO_REMOVE), "");
        writeStream.write(line);
        indexOfNewLine = buffer.search("\n");
    }
})

readStream.on("end", () => {
    buffer = buffer.replace(new RegExp(STRING_TO_REMOVE), "");
    writeStream.write(buffer);
    writeStream.close();
})

There are a few assumptions with this solution such as the data being UTF-8, there only being 1 bad word potentially per line, every line having some text (I didn't test for that), and that every line ends with new line and not some other line ending.

Heres the docs for streams in Node another thought I had was to use pipe and a transform stream but that seems like over kill.

Upvotes: 1

VeryGoodDog
VeryGoodDog

Reputation: 335

You could use a file read stream. However, you would have to find a way to detect if the read data only contains part of the result.

Upvotes: 1

DynasticSponge
DynasticSponge

Reputation: 1431

https://nodejs.org/api/fs.html#fs_fs_read_fd_buffer_offset_length_position_callback

Dont read the whole file at once... read a small buffered piece of it.. and look for your input with that buffered piece.... then increment your buffer starting position and do it again.... would recommend having each buffer start not at the end of the previous buffer... but overlap by at least the expected size of the data being sought so that you dont run into half of your data being at end of one buffer and other half at beginning of the other

Upvotes: 1

Related Questions