Databricks Flatten Nested JSON to Dataframe with PySpark

Question

I am trying to Convert a nested JSON to a flattened DataFrame.

I have read in the JSON as follows:

df = spark.read.json("/mnt/ins/duedil/combined.json")

The resulting dataframe looks like the following:

I have made a start on flattening the dataframe as follows

display(df.select ("companyId","countryCode"))

The above will display the following

I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"

Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?

Emma · Accepted Answer

Your financials is an array so if you want to extract something within the financials, you need some array transformations.

One example is to use transform.

from pyspark.sql import functions as F
df.select(
    "companyId",
    "countryCode",
    F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)

This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

Databricks Flatten Nested JSON to Dataframe with PySpark

Answers (1)

Related Questions