Reputation: 30045
I got spark dataframe which is loaded from a multiline JSON file.
One of the column (data) schema is as below:
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f: struct (nullable = true)
| | | |-- 0: struct (nullable = true)
| | | | |-- v: double (nullable = true)
| | |-- ts: string (nullable = true)
And sample Data:
array
0: {"f": {"0": {"v": 25.08}}, "ts": "2021-01-11T05:59:00.170Z"}
1: {"f": {"0": {"v": 25.92}}, "ts": "2021-03-22T03:29:00.170Z"}
2: {"f": {"0": {"v": 25.94}}, "ts": "2021-03-22T03:39:00.173Z"}
3: {"f": {"0": {"v": 25.95}}, "ts": "2021-03-22T03:49:00.170Z"}
4: {"f": {"0": {"v": 25.99}}, "ts": "2021-03-22T04:00:00.173Z"}
I want to just extract ts and v.
Example result
Upvotes: 0
Views: 859
Reputation: 42422
You can explode the array of structs into multiple rows, and select the required struct elements:
import pyspark.sql.functions as F
df2 = df.select(F.explode('data').alias('data')).select('data.ts', 'data.f.0.v')
df2.show(truncate=False)
+------------------------+-----+
|ts |v |
+------------------------+-----+
|2021-01-11T05:59:00.170Z|25.08|
|2021-03-22T03:29:00.170Z|25.92|
|2021-03-22T03:39:00.173Z|25.94|
|2021-03-22T03:49:00.170Z|25.95|
|2021-03-22T04:00:00.173Z|25.99|
+------------------------+-----+
Upvotes: 2