Patterson
Patterson

Reputation: 2821

Databricks Flatten Nested JSON to Dataframe with PySpark

I am trying to Convert a nested JSON to a flattened DataFrame.

I have read in the JSON as follows:

df = spark.read.json("/mnt/ins/duedil/combined.json")

The resulting dataframe looks like the following:

enter image description here

I have made a start on flattening the dataframe as follows

display(df.select ("companyId","countryCode"))

The above will display the following

enter image description here

I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"

Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?

Upvotes: 0

Views: 1788

Answers (1)

Emma
Emma

Reputation: 9343

Your financials is an array so if you want to extract something within the financials, you need some array transformations.

One example is to use transform.

from pyspark.sql import functions as F
df.select(
    "companyId",
    "countryCode",
    F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)

This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

Upvotes: 1

Related Questions