Reputation: 2821
I am trying to Convert a nested JSON to a flattened DataFrame.
I have read in the JSON as follows:
df = spark.read.json("/mnt/ins/duedil/combined.json")
The resulting dataframe looks like the following:
I have made a start on flattening the dataframe as follows
display(df.select ("companyId","countryCode"))
The above will display the following
I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"
Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?
Upvotes: 0
Views: 1788
Reputation: 9343
Your financials
is an array so if you want to extract something within the financials
, you need some array transformations.
One example is to use transform
.
from pyspark.sql import functions as F
df.select(
"companyId",
"countryCode",
F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)
This will return the fiveYearCAGR
in an array. If you need to flatten it further, you can use explode
/explode_outer
.
Upvotes: 1