Fixing invalid JSON

Question

I am working with a very large .json file that is a long series of curly brace pairs, each representing a line from a book. Here are two such pairs. They just continue over and over:

  {"index":{"_index":"tolstoy","_id":27}}
    {"type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"}
  {"index":{"_index":"tolstoy","_id":28}}
    {"type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}

From what I can tell the first _id is the id of the item. It goes from 0 to 200,000 and corresponds to total lines in the JSON file. In turn, line_id is the line in the particular work. Speech number is the speech by a character.

This is not valid json (as per JSON lint). I have the actual file so can modify it in any way that I want.

I would imagine it ought to be an array of line objects, so I should place square brackets around the entire thing. And I think should place commas between the entries. However, when I do that it still is invalid. I can make it valid, by putting a comma between the index and type lines but that does not make sense intuitively.

How can I modify the above so that it is nice well formed JSON that I can use in an api.

Thanks for any suggestions.

Chris Schaller · Accepted Answer

Data in your current format can be referred to as Interleaved, that is there are multiple distinct types of records however their only relation to each other is by their position in the file. This structure provides an interesting challenge to represent in JSON, which is an object notation aslos referred to as a Graph of hierarchical data.

When converting this data to JSON we should consider what value the structure will add to the final consumer, there is always some value in making the data easier to parse and consume, even if you are not selling the data feed, you can reduce costs if you need to answer less questions about the data later.

The two record types in the set are:

Index
Text Entry

Because there is a 1:1 relationship between these records, there is no value in nesting one of these records within the other in the JSON output. There are also no shared/common fields between the two records, neither is there a linking field between the two. This means that each record is dependent on the other.

From the consumer point of view, the index is meaningless on its own, indexes are usually designed to facilitate searching for or referencing a record quickly.

As the task here is to process the data line by line and we've identified there is little value in producing a nested structure and there are no common fields, this interleaved dataset can be easily merged or flattened into a simple array of single records

After analysing the data we might be able to identify ways to normalise it into object graphs but it would require re-processing of the entire file and the overall benefit to the consuming user would be questionable, data is generally easier to consume by a wider array of tools if it is flat, rather than a nested structure.

To flatten this data, we take the fields from the index record, and inject them into the text entry, in this example I have also flattened out the index structure that was itself a single nested object:

[
  {"index":"tolstoy","_id":27, "type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"},
  {"index":"tolstoy","_id":28, "type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}
]

See this in json2table.com:

The steps to achieve this:

prefix the file with a new line that starts the array: [
remove {"_index": this is safe to do globally with simple find and replace
join the index and text entry records by replacing the following, including the line break with a comma:
```
}}
  {
```
if you are using visual studio, this can be achieved with a simple find and replace with a regular expression, the following is just an example to illustrate what is going to be replaced:
- after applying the previous transformations
Now append the whole file with the array close character: ]
Append every line the ends with } with a comma

4. and 5. can be reversed, from an automation point of view I found that as an easier way to ensure that last line did not have a comma

things to consider:

Having flattened the index, I would also suggest changing the name of _id to index_id after reading this post back I'm surprised I missed it the first time.

Overall the names used for fields in this file are intuitive, but if the file itself is quite large (and if that is an issue...), then you could save a few bytes by reducing the names of other fields like text_entry to text, speech_number to speechId/speechNo/speechNum,

When reducing names for data integrations you should try to avoid reducing columns to ambiguous terms like line_id to lid or speech_number to sid as this imposes more assumptions on the consumer, which means more time you need to spend documenting or answering questions.

When optimising data feeds, you need to find a balance between what is theoretically achievable and what is practical with the resources available to both you and the consumer.

each case will be different, for different reasons.

If every record in the file has the same value for a field, then that field could be removed entirely, in the current example type has the potential to fall into this category.

If bytes across the wire is a concern, and you want to stick with JSON, then normalising the data into a nested object graph is an option, you should use some sort of mapping or translation service to do this, or code, I wouldn't attempt it by hand, but it might look something like nested arrays in this order:

index (author)
- books
  - speeches
    - text entries

If you were to process this into a normalised structure, it might look something like this:

For a large file (with whitespace removed) that could make a significant difference whilst it is still easy to target the nodes in the graph via expression syntax, importantly no information has been lost.

Fixing invalid JSON

Answers (1)

things to consider:

Related Questions