Dcook
Dcook

Reputation: 961

How access struct elements inside pyspark dataframe?

I have the following schema for a pyspark dataframe

root
 |-- maindata: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- label: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- unit: string (nullable = true)
 |    |    |    |-- dateTime: string (nullable = true)

Giving some data for a particular row which I received by df.select(F.col("maindata")).show(1,False)

|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]] 

I want access the tripcount value inside this ex: [TripCount -> 136,135 etc,What is the best way to access this data?TripC is present multiple times and also is there any way to access say for example only label data like maindata.label..?

Upvotes: 2

Views: 4166

Answers (1)

Alex Ott
Alex Ott

Reputation: 87174

I would suggest to do explode multiple times, to convert array elements into individual rows, and then either convert struct into individual columns, or work with nested elements using the dot syntax. For example:

from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
 |-- data: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |    |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
|        data|
+------------+
|[k1, v1, v2]|
+------------+

or you can extract members of the struct as individual columns:

from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: string (nullable = true)

>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

Upvotes: 2

Related Questions