nodeJS - huge string files malfunction

Question

I'm experiencing really weird issue in my nodeJS code. Code is basically loading JSON object serialized into relatively big, but not really enormous file - ~150mb. The problem is that really non-deterministic stuff happens when I'm trying to load this file:

lapsio@linux-qzuq /d/g/GreenStorage> node
> k1=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k1.length
157839101
> k2=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k2.lengFATAL ERROR: invalid array length Allocation failed - process out of memory
fish: “node” terminated by signal SIGABRT (Abort)

second try

> k1=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k2=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k1.length
157839101
> k2.length
157839101
> k1==k2
false

Ofc file is already cached in RAM at this step judging from response time, so it's not storage issue. My actual app:

try{
  var ind = JSON.parse(args.legacyconvert?bfile:content),
      ostr = String(args.legacyconvert?bfile:content),
      str = JSON.stringify(ind,null,2);

  for (var i = 0, l = str.length ; i < l ; i++)
    if (str[i]!=ostr[i]){
      console.error('Soft bug occured - it\'s serious bug and probably classifies as node bug or linux memcache bug. Should be reported');
      throw ('Original string and reparsed don\'t match at '+i+' byte - system string conversion malfunction - abtorting')
    }

  return ind;
} catch (e) {
  console.error('Could not read index - aborting',p,e);
  process.exit(11);
}

results:

lapsio@linux-qzuq /d/g/G/D/c/fsmonitor> sudo ./reload.js -e ../../../../etc/md5index/*.extindex
Reading index... ( ../../../../etc/md5index/green-Documents.extindex )
Soft bug occured - it's serious bug and probably classifies as node bug or linux memcache bug. Should be reported
Could not read index - aborting ../../../../etc/md5index/green-Documents.extindex Original string and reparsed don't match at 116655242 byte - system string conversion malfunction - abtorting
lapsio@linux-qzuq /d/g/G/D/c/fsmonitor> sudo ./reload.js -e ../../../../etc/md5index/*.extindex
Reading index... ( ../../../../etc/md5index/green-Documents.extindex )
Soft bug occured - it's serious bug and probably classifies as node bug or linux memcache bug. Should be reported
Could not read index - aborting ../../../../etc/md5index/green-Documents.extindex Original string and reparsed don't match at 39584906 byte - system string conversion malfunction - abtorting

And it returns random byte mismatch every time. Also there's like 50% probability that file will be corrupted after save. Sometimes it doesn't even properly parse because it finds some weird non-ASCII character like [SyntaxError: Unexpected token 䀠]. It's node from OpenSUSE repo. I've tried on many machines. It's relatively hard to reproduce this bug because it happens quite randomly but once it appears for the first time, it appears more or less always till reboot.

lapsio@linux-qzuq /d/g/GreenStorage> node -v
v0.12.7

PC has 16 gb ram and node doesn't even hit 10% of that, so I'm sure it's not lack of ram issue. And it doesn't seem to be filesystem related issue because md5sum and other hash generators always return valid checksum. Only node fails. I'm not sure what to think about it. Does it acually classify as bug?

O. Jones · Accepted Answer

Your code shows that you're slurping the big JSON file and then parsing it. That means you'll need room for both the raw file and the resulting parsed object. That may be partially to blame for your unpredictable memory-exhaustion problems.

Most people working with files of the size you mention try to use a streaming, or incremental, parsing method. That way the raw data flows through your program, and doesn't have to all be there at the same time.

You might want to check out this streaming JSON parser. It may allow you to successfully get through this chunk of data. https://github.com/dominictarr/JSONStream

A second possibility is to (ab-)use the second parameter to JSON.parse(). Called revifify, it's a function that gets called with each object found in the JSON text file. You could respond to each call to that function by somehow writing the object to a file (or maybe a dbms), and then returning a null result. That way, JSON.parse won't need to store every object it encounters. You'll have to mess around with this to get it to work correctly. With this tactic you'll still slurp the big input file, but you'll stream the output.

Another possibility is to do your best to split your single JSON document into a sequence of records, of smaller documents. (It seems likely that a dataset of that size can rationally be split up that way.)

nodeJS - huge string files malfunction

Answers (2)

Related Questions