Matt Burland
Matt Burland

Reputation: 45155

Validating JSON in a stream

I have (potentially large) json files being uploaded that need to be written out somewhere else. I would like to do at least some basic validation (for example, make sure they are valid JSON - maybe even apply a schema) but I'd like to avoid having to load the entire (again, potentially large) file into memory and then have to write it out again. I'm using JSON.Net and thought I could do something like this:

using (var sr = new StreamReader(source))
using (var jsonReader = new JsonTextReader(sr))
using (var textWriter = new StreamWriter(myoutputStream))
using (var outputStream = new JsonTextWriter(textWriter))
{
    while (jsonReader.Read())
    {
        // TODO: any addition validation!
        outputStream.WriteToken(jsonReader);
    }
}

With the idea being that the reader would walk the JSON file as it comes in and write it out as it processes each token. If there is a mistake in the input, it'll throw an exception which I can handle by returning an error message to the user.

The problem is that if I step through this code using a JSON file that consists of a single object with an array property which has a collection of more objects (the whole file is about 1.3k lines formatted), I expected it to step through. Instead it seems like it just reads in the entire object and spits it back out again in one step.

Is there a way to handle large JSON objects from a steam, make sure they really are valid JSON and stream them back out without having to have to hold the entire object in memory at once).

Although the answer might be more general, the data I'm currently attempting to handle is GeoJson data. A (very short) example looks like this:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "name": "Dinagat Islands"
  }
}

A much longer example might be:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "name": "Van Dorn Street",
        "marker-color": "#0000ff",
        "marker-symbol": "rail-metro",
        "line": "blue"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          -77.12911152370515,
          38.79930767201779
        ]
      }
    },...//lots more objects
  ]
}

The suggestion from here: https://www.newtonsoft.com/json/help/html/ReadingWritingJSON.htm

Is that it should read individual tokens StartObject, PropertyName, etc...

Upvotes: 2

Views: 2645

Answers (4)

gorillapower
gorillapower

Reputation: 480

I had a similar requirement where I needed to parse a JSON file and extract the 'schema' or shape of the data. While being trivial with small documents (using in memory serialization), handling large JSON files proved to be difficult as you could easily encounter of memory exceptions.

After much head scratching and considering the implementation in this answer, the code below will stream over JSON data and build up a sample json file, and when it encounters arrays, it only considers the first item. This way I can effeciently generate a JSON 'schema' from existing JSON. I stuggled to find other solutions that could generate a schema from existing JSON without loading the data into memory first.

    //https://stackoverflow.com/questions/43747477/how-to-parse-huge-json-file-as-stream-in-json-net
    //https://stackoverflow.com/questions/49241890/validating-json-in-a-stream
    /// <summary>
    /// Generates sample json from raw json.
    /// When generator encounters arrays, will only consider the first item.
    /// </summary>
    /// <param name="inputStream"></param>
    /// <returns></returns>
    /// <exception cref="Newtonsoft.Json.JsonReader">Throws if input stream is invalid json.</exception>
    private async static Task<Stream> GenerateJSONSchemaAsync(Stream inputStream)
    {
        MemoryStream myOutputStream = new MemoryStream();

        using (StreamReader sr = new StreamReader(inputStream))
        using (JsonReader reader = new JsonTextReader(sr))
        using (StreamWriter textWriter = new StreamWriter(myOutputStream))
        using (JsonTextWriter outputStream = new JsonTextWriter(textWriter))
        {
            while (await reader.ReadAsync())
            {
                //If path contains [x] ignore if [x] index is > 0 (ie > than first child)
                MatchCollection match = Regex.Matches(reader.Path, @"(?<=\[).+?(?=\])");

                if (!match.Any() || !match.Any(x => int.Parse(x.Value) > 0))
                {
                    await outputStream.WriteTokenAsync(reader, false);
                }
            }
        }

        return myOutputStream;
    }

Ideally, instead of just ignoring sibling array items, it would be better to do something like a merge operation for each or have AddPropertyIfNotExists functionality, so that the schema is continually updated to account for discrepancies between item properties (possible that some array item objects may have additionaly properties).

Upvotes: 0

Broderick
Broderick

Reputation: 21

If you can pull the stream twice (to avoid pulling directly into memory) or save the stream as a file so you can create multiple streams from it, use the JSchemaValidatingReader and use an empty while loop on the read. The JschemaValidatingReader will go through the entire JSON without it going into memory.

using (var stream = fileStream.Stream)
using (var streamReader = new StreamReader(stream))
using (var jsonReader = new JsonTextReader(streamReader))
using (var validatingReader = new JSchemaValidatingReader(jsonReader) { Schema = schema })
{
  validatingReader.ValidationEventHandler += (o, a) =>
  {
      // log or output the validation errors that come up here
  };

  while (validatingReader.Read())
  {
      // Do nothing here - forces reader through the stream and validates
  }
}

schema after the JsonValidatingReader is the schema you're validating against. You will have to do any of your custom validation inside of classes that extend JsonValidator which you can see how to do here. After it validates, you would pull the stream against either from the remote source or from file.

Upvotes: 1

Matt Burland
Matt Burland

Reputation: 45155

To at least partially answer my own question, the problem is here:

outputStream.WriteToken(jsonReader);

Which, as it turns out, writes the token and all it's children. Which I assume means it basically reads the whole file. The first token would be a StartObject and by writing all it's children out it has to read all the way to the EndObject token.

Using:

outputStream.WriteToken(jsonReader, false);

Will not automatically read all the children and will instead step through token by token, which I'm guessing (hoping) will be more memory efficient with very large files.

Still not 100% sure if that's the most efficient solution and it would be nice to do at least a little validation beyond just making sure it's valid JSON.

Upvotes: 1

Hussein Salman
Hussein Salman

Reputation: 8256

If you expect your JSON files have some kind of standardized structure. Then, you can create a class with the same attributes and then deserialize the JSON file into the correct class.

If the JSON format is valid, then deserialization would happen. As an example:

[JsonObject]
public class MyClass
{
    [JsonProperty("id")]
    public string Id {get; set;}
    [JsonProperty("name")]
    public string Name { get; set; }


    public MyClass() { }
}

Then deserialize the JSON via this call:

var myDeserializedJSON= JsonConvert.DeserializeObject<MyClass>(jsonData);

Upvotes: -1

Related Questions