LockTar
LockTar

Reputation: 5467

Transform complex json files using ADF

I want to transform multiple complex JSON files into one complex JSON file using Azure Data Factory dataflow.

The multiple complex input JSON files are in the following format:

{
  "creationDate": "2022-01-19T17:00:17Z",
  "count": 2,
  "data": [
    {
      "id": "",
      "name": "Project A",
      "url": "",
      "state": "Open",
      "revision": 1,
      "visibility": "private",
      "lastUpdateTime": "2019-09-23T08:44:45.103Z"
    },
    {
      "id": "",
      "name": "Project B",
      "url": "",
      "state": "Done",
      "revision": 1,
      "visibility": "private",
      "lastUpdateTime": "2019-12-31T09:38:49.16Z"
    }
  ]
}

We want to transform those files to one single json file in the format:

[
  {
    "date": "2022-01-14",
    "count": 2,
    "projects": [
      {
        "name": "Project A",
        "state": "",
        "lastUpdateTime": ""
      },
      {
        "name": "Project B",
        "state": "",
        "lastUpdateTime": ""
      }
    ]
  },
  {
    "date": "2022-01-17",
    "count": 3,
    "projects": [
      {
        "name": "Project A",
        "state": "",
        "lastUpdateTime": ""
      },
      {
        "name": "Project B",
        "state": "",
        "lastUpdateTime": ""
      },
      {
        "name": "Project C",
        "state": "",
        "lastUpdateTime": ""
      }
    ]
  }
]

We were using the derived column with the expression @(name=data.name, state=data.state). enter image description here

Can someone help us how to do this? We tried a lot of things like derived column, first flattening but we can't get it as we like...

Thanks!

Upvotes: 0

Views: 121

Answers (1)

LockTar
LockTar

Reputation: 5467

The solution on the end was pretty close on what we had.

So our final solution is as follow:

  1. First flatten with Unroll by set to data. We also mapped the creationDate to date.
  2. Create a derived column called projects with as expression @(name=name,state=state,lastUpdateTime=lastUpdateTime,url=url)
  3. Group activity with a Group by on date. Set Aggregates for count on first(count) and (this was the solution) set projects to collect(projects).
  4. Select activity which will select the columns date, count and projects.
  5. Sort activity with a sort on date Ascending.
  6. Sink with file name option to output to single file and partion option set to single partion

Note: Because we have a sink to one big json file (output to single file). The sorting wasn't correct written to json. If we debugged (data preview) the dataflow everything was correct. Strange behavior. When we changed the Sort activity the option Partion option to Single partion the json file had the right sort order.

enter image description here

Upvotes: 1

Related Questions