user21046852
user21046852

Reputation: 13

ADF Data Flow Remove empty JSON objects and arrays when building JSON files

I am using a Data Flow in Azure Data Factory to transform rows of CSV files into JSON documents of a standard layout. I can't figure out how to get rid of empty JSON objects and arrays when there is no data to populate.

For example if I have a CSV like below:

firstName,lastName,Email,Address1,City,State,Zip
Bob,Smith,[email protected],123 st,Somecity,TX,12345

I need to turn it into this:

{
    "firstName": "Bob",
    "lastName": "Smith",
    "contactData": [
        {
            "contactType": "postalAddress",
            "contactData": {
                "postalAddress1": "123 st",
                "postalCity": "Somecity",
                "postalState": "TX",
                "postalCode": "12345"
            }
        },
        {
            "contactType": "email",
            "contactData": {
                "emailAddress": "[email protected]"
            }
        }
    ]
}

I am using derived columns to build the subcolumns and arrays. I have been able to produce the JSON above. The problem I run into is that if an email or address is null, I want to remove the object from the array. If both are null, I want to remove the entire contactData object.

Example:

firstName,lastName,Email,Address1,City,State,Zip
Bob,Smith,,,,,

I need to turn it into this:

{
    "firstName": "Bob",
    "lastName": "Smith"
}

If I set all of the child objects to NULL with IF statements I can produce something like this:

{
    "firstName": "Bob",
    "lastName": "Smith",
    "contactData": [
        {
            "contactData": {}
        },
        {
            "contactData": {}
        }
    ]
}

but I can't get rid of the entire section. The Sink will get rid of the empty string objects, but not the nested JSON objects and arrays. Is there any way to do this in ADF Data Flows?

Upvotes: 0

Views: 1316

Answers (1)

Saideep Arikontham
Saideep Arikontham

Reputation: 6104

  • You can split the data and then apply union to get the desired result.
  • Since I don't have the previous data transformations, I have taken the following data as my source data.
[{
    "firstName": "Bob",
    "lastName": "Smith",
    "contactData": [
        {
            "contactType": "postalAddress",
            "contactData": {
                "postalAddress1": "123 st",
                "postalCity": "Somecity",
                "postalState": "TX",
                "postalCode": "12345"
            }
        },
        {
            "contactType": "email",
            "contactData": {
                "emailAddress": "[email protected]"
            }
        }
    ]
},
{
    "firstName": "b1",
    "lastName": "s1",
    "contactData": [
        {
            "contactData": {}
        },
        {
            "contactData": {}
        }
    ]
},
{
    "firstName": "Bob1",
    "lastName": "Smith1",
    "contactData": [
        {
            "contactType": "postalAddress",
            "contactData": {
                "postalAddress1": "123 st1",
                "postalCity": "Somecity1",
                "postalState": "TX1",
                "postalCode": "123456"
            }
        },
        {
            "contactType": "email",
            "contactData": {
                "emailAddress": "[email protected]"
            }
        }
    ]
},
{
    "firstName": "b2",
    "lastName": "s2",
    "contactData": [
        {
            "contactData": {}
        },
        {
            "contactData": {}
        }
    ]
}]
  • Now, I have taken a sample derived column to find the length of contactType. The rows without any contactType would have the same length (convert to string and find length).
tp : length(toString(contactData.contactType))

enter image description here

  • So, split the data based on whether the contactType (converted to string) length is equal to 2 or not. The split condition would be as shown below:
length(toString(contactData.contactType))!=2

enter image description here

  • This will split the data as required. The noContact stream would have the following data:

enter image description here

  • Now select only required columns. I have used rule based selection to select only columns where name is not contactData.
condition : name!='contactData'
column name : $$

enter image description here

  • Now apply union transformation by Name on hasContact and select1 stream.

enter image description here

  • I have configured the sink dataset as shown in the image below:

enter image description here

  • In sink, output to a single JSON file with selected name (under settings tab). The data preview in sink would be as:

enter image description here

  • Once the file is written, the null fields would not be written, so the data would be as per requirement. The following is an image for reference.

enter image description here

Upvotes: 2

Related Questions