taygan
taygan

Reputation: 145

Nested JSON to CSV with U-SQL via Azure Data Lake

Question

Is there a more elegant way of parsing the nested JSON document example below?

Particularly the @middle step, based on my readings I would have thought JsonFunctions.JsonTuple(part_b, "rating") AS rating in Step 3 would have been sufficient but it does not seem to work, hence why I have added @middle.

sample.json

{
    "listings": [
        {
            "part_a": {
                "random": "x"
            },
            "part_b": {
                "listing_id": "001",
                "rating": {
                    "text": "four",
                    "numeric": "4.0"
                }
            }
        },
        {
            "part_a": {
                "random": "y"
            },
            "part_b": {
                "listing_id": "002",
                "rating": {
                    "text": "seven",
                    "numeric": "7.0"
                }
            }
        },
        {
            "part_a": {
                "random": "z"
            },
            "part_b": {
                "listing_id": "003",
                "rating": {
                    "text": "two",
                    "numeric": "2.0"
                }
            }
        }
    ]
}

sample.usql

CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"adl://ADL_NAME.azuredatalakestore.net/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"adl://ADL_NAME.azuredatalakestore.net/Microsoft.Analytics.Samples.Formats.dll";

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Define Input and Output
DECLARE @InputFile string = @"adl://ADL_NAME.azuredatalakestore.net/sample.json";
DECLARE @OutputFile string = @"adl://ADL_NAME.azuredatalakestore.net/sample.csv";

// 2. Extract JSON (schema on read)
@json =
EXTRACT
    part_a string,
    part_b string
FROM
    @InputFile
USING new JsonExtractor("$.listings[*]");

// 3. Convert string into dictionary
@listing_dict =
SELECT
    JsonFunctions.JsonTuple(part_a) AS random,
    JsonFunctions.JsonTuple(part_b) AS listing
FROM @json;

// 4. Extract values
@middle =
SELECT
    random,
    listing,
    JsonFunctions.JsonTuple(listing["rating"]) AS rating
FROM @listing_dict;

@listing_values =
SELECT
    random["random"] AS random,
    listing["listing_id"] AS listing_id,
    rating["text"] AS rating_text,
    rating["numeric"] AS rating_numeric
FROM @middle;

// 5. Write output to CSV
OUTPUT @listing_values
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

sample.csv

"random","listing_id","rating_text","rating_numeric"
"x","001","four","4.0"
"y","002","seven","7.0"
"z","003","two","2.0"

Upvotes: 0

Views: 726

Answers (1)

maya-msft
maya-msft

Reputation: 432

By using the MultiLevelJsonExtractor (part of same library), your code could be simplified to something like:

// 1. Define Input and Output
DECLARE @InputFile string = @".../sample.json";
DECLARE @OutputFile string = @".../sample.csv";

// 2. Extract JSON (schema on read)
@json =
EXTRACT
    random string,
    listing_id string,
    rating_text string,
    rating_numeric string
FROM
    @InputFile
USING new MultiLevelJsonExtractor("listings[*]",true, "part_a.random", "part_b.listing_id", "part_b.rating.text", "part_b.rating.numeric" );


// 3. Write output to CSV
OUTPUT @json
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);

See details about this extractor on GitHub: MultiLevelJsonExtractor.cs

Upvotes: 2

Related Questions