Reputation: 177
The data is
data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
{"_id":"Mod0011", "Name": "FFFF4"}]},
{"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]
Need to frame a dataframe as follows
pid | _id | Name |
---|---|---|
Inst001 | Mod001 | FFFF |
Inst001 | Mod0011 | FFFF4 |
Inst002 | Mod002 | DDD |
The approach I had is
Is there any builtin method available in pyspark for the above problem?
Upvotes: 0
Views: 116
Reputation: 42392
Create a dataframe with a proper schema, and do inline
on the Model001
column:
df = spark.createDataFrame(
data,
'_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')
df.show(truncate=False)
+-------+-------+-----+
|pid |_id |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD |
+-------+-------+-----+
Upvotes: 1