Ryan Lester
Ryan Lester

Reputation: 2403

JSON parser that can handle large input (2 GB)?

So far, I've tried (without success):

I'm gonna try yajl next, but Json.NET handles this without any issues so I'm not sure why it should be such a big problem in C++.

Upvotes: 1

Views: 5241

Answers (5)

Hyndrix
Hyndrix

Reputation: 4452

I just faced the same problem with Qt's 5.12 JSON support. Fortunately starting with Qt 5.15 (64 Bit) reading of large JSON files (I tested 1GB files) works flawlessly.

Upvotes: 0

luxigo
luxigo

Reputation: 113

When dealing with records you can for example format your json and use the newline as a separator between objects, then parse each line separately eg:

"records": [
{ "someprop": "value", "someobj": { .....   } ... },
.
.
.

or:

"myobj": {
"someprop": { "someobj": {}, ... },
.
.
.

Upvotes: 0

matiu
matiu

Reputation: 7725

I've recently finished (probably still a bit beta) such a library:

https://github.com/matiu2/json--11

If you use the json_class .. it'll load it all into memory, which is probably not what you want.

But you can parse it sequentially by writing your own 'mapper'.

The included mapper, iterates through the JSON, mapping the input to JSON classes:

https://github.com/matiu2/json--11/blob/master/src/mapper.hpp

You could write your own that does whatever you want with the data, and feed a file stream into it, so as not to load the whole lot into memory.

So as an example to get you started, this just outputs the json data in some random format, but doesn't fill up the memory any (completely untested nor compiled):

#include "parser.hpp"
#include <fstream>
#include <iterator>
#include <string>

int main(int argc, char **) {

  std::ifstream file("hugeJSONFile.hpp");
  std::istream_iterator<char> input(file);

  auto parser = json::Parser(input);
  using Parser = decltype(parser);
  using std::cout;
  using std::endl;

  switch (parser.getNextType()) {
  case Parser::null:
    parser.readNull();
    cout << "NULL" << endl;
    return;
  case Parser::boolean:
    bool val = parser.readBoolean();
    cout << "Bool: " << val << endl;
  case Parser::array:
    parser.consumeOneValue();
    cout << "Array: ..." << endl;
  case Parser::object:
    parser.consumeOneValue();
    cout << "Map: ..." << endl;
  case Parser::number: {
    double val = parser.readNumber<double>();
    cout << "number: " << val << endl;
  }
  case Parser::string: {
    std::string val = parser.readString();
    cout << "string: " << val << endl;
  }
  case Parser::HIT_END:
  case Parser::ERROR:
  default:
    // Should never get here
    throw std::logic_error("Unexpected error while parsing JSON");
  }
  return 0;
}

Addendum

Originally I had planned for this library to never copy any data. eg. read a string just gave you a start and end iterator to the string data in the input, but because we actually need to decode the strings, I found that methodology too impractical.

This library automatically converts \u0000 codes in JSON to utf8 encoding in standard strings.

Upvotes: 0

Ryan Lester
Ryan Lester

Reputation: 2403

Well, I'm not proud of my solution, but I ended up using some regex to split my data up into top-level key-value pairs (each one being only a few MB), then just parsed each one of those pairs with Qt's JSON parser and passed them into my original code.

Yajl would have been exactly what I needed for something like this, but I went with the ugly regex hack because:

  1. Fitting my logic into Yajl's callback structure would have involved rewriting enough of my code to be a pain, and this is just for a one-off MapReduce job so the code itself doesn't matter long-term anyway.

  2. The data set is controlled by me and guaranteed to always work with my regex.

  3. For various reasons, adding dependencies to Elastic MapReduce deployments is a bigger hassle than it should be (and static Qt compilation is buggy), so for the sake of not doing more work than necessary I'm inclined to keep dependencies to a minimum.

  4. This still works and performs well (both time-wise and memory-wise).

Note that the regex I used happens to work for my data specifically because the top-level keys (and only the top level keys) are integers; my code below is not a general solution, and I wouldn't ever advise a similar approach over a SAX-style parser where reasons #1 and #2 above don't apply.

Also note that this solution is extra gross (splitting and manipulating JSON strings before parsing + special cases for the start and end of the data) because my original expression that captured the entire key-value pairs broke down when one of the pairs happened to exceed PCRE's backtracking limit (it's incredibly annoying in this case that that's even a thing, especially since it's not configurable through either QRegularExpression or grep).


Anyway, here's the code; I am deeply ashamed:

QFile file( argv[1] );
file.open( QIODevice::ReadOnly );
QTextStream textStream( &file );

QString jsonKey;
QString jsonString;
QRegularExpression jsonRegex( "\"-?\\d+\":" );

bool atEnd = false;


while( atEnd == false )
{
    QString regexMatch  = jsonRegex.match
    (
        jsonString.append( textStream.read(1000000) )
    ).captured();

    bool isRegexMatched = regexMatch.isEmpty() == false;

    if( isRegexMatched == false )
    {
        atEnd = textStream.atEnd();
    }

    if( atEnd || (jsonKey.isEmpty() == false && isRegexMatched) )
    {
        QString jsonObjectString;

        if( atEnd == false )
        {
            QStringList regexMatchSplit = jsonString.split( regexMatch );

            jsonObjectString = regexMatchSplit[0]
                .prepend( jsonKey )
                .prepend( LEFT_BRACE )
            ;

            jsonObjectString = jsonObjectString
                .left( jsonObjectString.size() - 1 )
                .append( RIGHT_BRACE )
            ;

            jsonKey    = regexMatch;
            jsonString = regexMatchSplit[1];
        }
        else
        {
            jsonObjectString = jsonString
                .prepend( jsonKey )
                .prepend( LEFT_BRACE )
            ;
        }

        QJsonObject jsonObject = QJsonDocument::fromJson
        (
            jsonObjectString.toUtf8()
        ).object();

        QString key = jsonObject.keys()[0];



        ... process data and store in boost::interprocess::map ...


    }
    else if( isRegexMatched )
    {
        jsonKey    = regexMatch;
        jsonString = jsonString.split( regexMatch )[1];
    }
}

Upvotes: 0

Yasser Asmi
Yasser Asmi

Reputation: 1170

Check out https://github.com/YasserAsmi/jvar. I have tested it with a large database (SF street data or something, which was around 2GB). It was quite fast.

Upvotes: 4

Related Questions