Reputation: 190
I need to parse JSON data from a string inside a Azure Data Flow. So far, I was able to parse all my data using the "Parse" function of the Data Flows. But now I am faced with a list of objects, and I don't know how to parse the values of that "complex array".
My data is looking like this:
There are some metadata fields (here null) and a Base64 encoded Body field. Every JSON document is in a separate JSON file.
The example data:
{
"metadata": null,
"Body": "eyJpZCI6ICIxIiwgImNvdW50IjogMTIsICJwcm9qZWN0cyI6IFt7Imd1aWQiOiAiMTIzMTItMTI0MzE0LTEyNDEtMTIzNCIsICJzdGF0dXMiOiAic3RvcHBlZCJ9LHsiZ3VpZCI6ICJjc2lzdi1uZDkyM24tMTM0MS0yMzQxIiwgInN0YXR1cyI6ICJydW5uaW5nIn1dfQ"
}
For clarification, the encoded example data looks like this:
{
"metadata": null,
"Body": "{"id": "1", "count": 12, "projects": [{"guid": "12312-124314-1241-1234", "status": "stopped"},{"guid": "csisv-nd923n-1341-2341", "status": "running"}]}"
}
My goal is to have a parquet file containing the data from the Body. So there should be three columns: id, count, projects. Projects should contain a list of complex objects.
The final result should look like this:
This is the result, when I load a JSON file, where the Body data is not encoded, but plain JSON containing the list of objects. Note, that this is not feasible for the original problem, where the JSON data is Base64 encoded.
When I load the example data into a dataflow the projection looks like this (as expected):
First, I need to decode the Base64 Body and then I can parse the JSON string:
And basic parsing is done here:
How can I parse the field "projects"? Given that every object in the list of the array field has the same schema.
I already tried parsing the field "projects" as string and add another Parse step to parse this string as "Array of documents", but the results are only Null values..
Upvotes: 0
Views: 4064
Reputation: 190
The final Dataflow looks like this:
The parsing has to be splitted in several parts.
Remember: The data I want to parse looks like this:
{
"metadata": null,
"Body": "{"id": "1", "count": 12, "projects": [{"guid": "12312-124314-1241-1234", "status": "stopped"},{"guid": "csisv-nd923n-1341-2341", "status": "running"}]}"
}
So first I need to parse the "Body" column, which is BodyDecoded, since I first had to decode from Base64. The array of objects has to be parsed as array of strings. Hence, the "Output column type" of the Parse step looks like this:
(id as string,
count as integer,
projects as string[])
The values are written in the BodyContent
column. To make the coming steps easier first the hierarchy is flattened. It is possible to use a column pattern for that, but I will do it explicitly here:
Also, the projects column is now renamed to projectsStringArray
Now the projectsStringArray
can be exploded using the "Flatten" step.
The column id
is also taken here, to be able to recollect the array later.
Now every string can be parsed by a "Parse" step, as usual
(guid as string,
status as string)
The parsed objects can be aggregated in lists again, using the "collect" function.
To get the desired structure the collected column has to be joined to the original data. The id
column can be used to join the data back.
After a final select, the structure looks as required:
Remarks: Thanks to Erik from Microsoft for his help! I was too focused on solving it using only the parsing step, that I didn't think about other ways to tackle the problem..
Upvotes: 1