Reputation: 1966
I am working with a very large .json file that is a long series of curly brace pairs, each representing a line from a book. Here are two such pairs. They just continue over and over:
{"index":{"_index":"tolstoy","_id":27}}
{"type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"}
{"index":{"_index":"tolstoy","_id":28}}
{"type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}
From what I can tell the first _id
is the id of the item. It goes from 0 to 200,000 and corresponds to total lines in the JSON file. In turn, line_id
is the line in the particular work. Speech number
is the speech by a character.
This is not valid json (as per JSON lint). I have the actual file so can modify it in any way that I want.
I would imagine it ought to be an array of line objects, so I should place square brackets around the entire thing. And I think should place commas between the entries. However, when I do that it still is invalid. I can make it valid, by putting a comma between the index and type lines but that does not make sense intuitively.
How can I modify the above so that it is nice well formed JSON that I can use in an api.
Thanks for any suggestions.
Upvotes: 1
Views: 1988
Reputation: 16574
Data in your current format can be referred to as Interleaved, that is there are multiple distinct types of records however their only relation to each other is by their position in the file. This structure provides an interesting challenge to represent in JSON, which is an object notation aslos referred to as a Graph of hierarchical data.
When converting this data to JSON we should consider what value the structure will add to the final consumer, there is always some value in making the data easier to parse and consume, even if you are not selling the data feed, you can reduce costs if you need to answer less questions about the data later.
The two record types in the set are:
Because there is a 1:1 relationship between these records, there is no value in nesting one of these records within the other in the JSON output. There are also no shared/common fields between the two records, neither is there a linking field between the two. This means that each record is dependent on the other.
From the consumer point of view, the index is meaningless on its own, indexes are usually designed to facilitate searching for or referencing a record quickly.
As the task here is to process the data line by line and we've identified there is little value in producing a nested structure and there are no common fields, this interleaved dataset can be easily merged or flattened into a simple array of single records
After analysing the data we might be able to identify ways to normalise it into object graphs but it would require re-processing of the entire file and the overall benefit to the consuming user would be questionable, data is generally easier to consume by a wider array of tools if it is flat, rather than a nested structure.
To flatten this data, we take the fields from the index record, and inject them into the text entry, in this example I have also flattened out the index structure that was itself a single nested object:
[
{"index":"tolstoy","_id":27, "type":"line","line_id":100,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"Weep. It will do you good"},
{"index":"tolstoy","_id":28, "type":"line","line_id":101,"book":"War and Peace","speech_number":1,"speaker":"Sonya","text_entry":"I will call Pyotr."}
]
The steps to achieve this:
prefix the file with a new line that starts the array: [
remove {"_index":
this is safe to do globally with simple find and replace
join the index and text entry records by replacing the following, including the line break with a comma:
}}
{
if you are using visual studio, this can be achieved with a simple find and replace with a regular expression, the following is just an example to illustrate what is going to be replaced:
Now append the whole file with the array close character: ]
Append every line the ends with }
with a comma
4. and 5. can be reversed, from an automation point of view I found that as an easier way to ensure that last line did not have a comma
Having flattened the index, I would also suggest changing the name of _id
to index_id
after reading this post back I'm surprised I missed it the first time.
Overall the names used for fields in this file are intuitive, but if the file itself is quite large (and if that is an issue...), then you could save a few bytes by reducing the names of other fields like text_entry
to text
, speech_number
to speechId
/speechNo
/speechNum
,
When reducing names for data integrations you should try to avoid reducing columns to ambiguous terms like
line_id
tolid
orspeech_number
tosid
as this imposes more assumptions on the consumer, which means more time you need to spend documenting or answering questions.
When optimising data feeds, you need to find a balance between what is theoretically achievable and what is practical with the resources available to both you and the consumer.
If every record in the file has the same value for a field, then that field could be removed entirely, in the current example type
has the potential to fall into this category.
If bytes across the wire is a concern, and you want to stick with JSON, then normalising the data into a nested object graph is an option, you should use some sort of mapping or translation service to do this, or code, I wouldn't attempt it by hand, but it might look something like nested arrays in this order:
If you were to process this into a normalised structure, it might look something like this:
For a large file (with whitespace removed) that could make a significant difference whilst it is still easy to target the nodes in the graph via expression syntax, importantly no information has been lost.
Upvotes: 1