Reputation: 33
I'm trying to split very large JSON files into smaller files for a given array. For example:
{
"headerName1": "headerVal1",
"headerName2": "headerVal2",
"headerName3": [{
"element1Name1": "element1Value1"
},
{
"element2Name1": "element2Value1"
},
{
"element3Name1": "element3Value1"
},
{
"element4Name1": "element4Value1"
},
{
"element5Name1": "element5Value1"
},
{
"element6Name1": "element6Value1"
}]
}
...down to { "elementNName1": "elementNValue1" } where N is a large number
The user provides the name which represents the array to be split (in this example "headerName3") and the number of array objects per file, e.g. 1,000,000
This would result in N files each containing the top name:value pairs (headerName1, headerName3) and up to 1,000,000 of the headerName3 objects in each file.
I'm using the excellent Newtonsof JSON.net and understand that I need to do this using a stream.
So far I have looked a reading in JToken objects to establish where the PropertyName == "headerName3" occurs when reading in the tokens but what I would like to do is then read in the entire JSON object for each object in the array and not have to continue parsing JSON into JTokens;
Here's a snippet of the code I am building so far:
using (StreamReader oSR = File.OpenText(strInput))
{
using (var reader = new JsonTextReader(oSR))
{
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
intObjectCount++;
}
else if (reader.TokenType == JsonToken.EndObject)
{
intObjectCount--;
if (intObjectCount == 1)
{
intArrayRecordCount++;
// Here I want to read the entire object for this record into an untyped JSON object
if( intArrayRecordCount % 1000000 == 0)
{
//write these to the split file
}
}
}
}
}
}
I don't know - and in fact, and am not concerned with - the structure of the JSON itself, and the objects can be of varying structures within the array. I am therefore not serializing to classes.
Is this the right approach? Is there a set of methods in the JSON.net library I can easily use to perform such operation?
Any help appreciated.
Upvotes: 2
Views: 8333
Reputation: 116575
You can use JsonWriter.WriteToken(JsonReader reader, true)
to stream individual array entries and their descendants from a JsonReader
to a JsonWriter
. You can also use JProperty.Load(JsonReader reader)
and JProperty.WriteTo(JsonWriter writer)
to read and write entire properties and their descendants.
Using these methods, you can create a state machine that parses the JSON file, iterates through the root object, loads "prefix" and "postfix" properties, splits the array property, and writes the prefix, array slice, and postfix properties out to new file(s).
Here's a prototype implementation that takes a TextReader
and a callback function to create sequential output TextWriter
objects for the split file:
enum SplitState
{
InPrefix,
InSplitProperty,
InSplitArray,
InPostfix,
}
public static void SplitJson(TextReader textReader, string tokenName, long maxItems, Func<int, TextWriter> createStream, Formatting formatting)
{
List<JProperty> prefixProperties = new List<JProperty>();
List<JProperty> postFixProperties = new List<JProperty>();
List<JsonWriter> writers = new List<JsonWriter>();
SplitState state = SplitState.InPrefix;
long count = 0;
try
{
using (var reader = new JsonTextReader(textReader))
{
bool doRead = true;
while (doRead ? reader.Read() : true)
{
doRead = true;
if (reader.TokenType == JsonToken.Comment || reader.TokenType == JsonToken.None)
continue;
if (reader.Depth == 0)
{
if (reader.TokenType != JsonToken.StartObject && reader.TokenType != JsonToken.EndObject)
throw new JsonException("JSON root container is not an Object");
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.PropertyName)
{
if ((string)reader.Value == tokenName)
{
state = SplitState.InSplitProperty;
}
else
{
if (state == SplitState.InSplitProperty)
state = SplitState.InPostfix;
var property = JProperty.Load(reader);
doRead = false; // JProperty.Load() will have already advanced the reader.
if (state == SplitState.InPrefix)
{
prefixProperties.Add(property);
}
else
{
postFixProperties.Add(property);
}
}
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.StartArray && state == SplitState.InSplitProperty)
{
state = SplitState.InSplitArray;
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.EndArray && state == SplitState.InSplitArray)
{
state = SplitState.InSplitProperty;
}
else if (state == SplitState.InSplitArray && reader.Depth == 2)
{
if (count % maxItems == 0)
{
var writer = new JsonTextWriter(createStream(writers.Count)) { Formatting = formatting };
writers.Add(writer);
writer.WriteStartObject();
foreach (var property in prefixProperties)
property.WriteTo(writer);
writer.WritePropertyName(tokenName);
writer.WriteStartArray();
}
count++;
writers.Last().WriteToken(reader, true);
}
else
{
throw new JsonException("Internal error");
}
}
}
foreach (var writer in writers)
using (writer)
{
writer.WriteEndArray();
foreach (var property in postFixProperties)
property.WriteTo(writer);
writer.WriteEndObject();
}
}
finally
{
// Make sure files are closed in the event of an exception.
foreach (var writer in writers)
using (writer)
{
}
}
}
This method leaves all the files open until the end in case "postfix" properties, appearing after the array property, need to be appended. Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work. If postfix properties are never encountered in practice, you can just close each file before opening the next and throw an exception in case any postfix properties are found. Otherwise you may need to parse the large file in two passes or close and reopen the split files to append them.
Here is an example of how to use the method with an in-memory JSON string:
private static void TestSplitJson(string json, string tokenName)
{
var builders = new List<StringBuilder>();
using (var reader = new StringReader(json))
{
SplitJson(reader, tokenName, 2, i => { builders.Add(new StringBuilder()); return new StringWriter(builders.Last()); }, Formatting.Indented);
}
foreach (var s in builders.Select(b => b.ToString()))
{
Console.WriteLine(s);
}
}
Prototype fiddle.
Upvotes: 5