s.k
s.k

Reputation: 303

Spark: Dataframe Transformation

I have dataframe_1:

+-------------+----+---------+
|         Name| Age|   Salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0|
|  Jae Crowder|25.0|6796117.0|
+-------------+----+---------+

and want to transform it to dataframe_2:

+----------------------------------------------------------------------------------------------------------------------+
|         json_data                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------+
|[{"Name": "Avery Bradley", "Age": 25.0, "Salary" 7730337.0}, {"Name": "Jae Crowder", "Age": 25.0, "Salary" 6796117.0}]|    
+----------------------------------------------------------------------------------------------------------------------+

I can do dataframe_1.toPandas().to_dict(orient="records"), but this would be dataframe to dict(json object) transformation and I need dataframe to dataframe transformation.

A solution in PySpark, if possible, would be appreciated.

Upvotes: 0

Views: 159

Answers (2)

mck
mck

Reputation: 42352

You can do a collect_list of json:

import pyspark.sql.functions as F

df2 = df.agg(F.collect_list(F.to_json(F.struct('*'))).alias('json_data'))

df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------+
|json_data                                                                                                     |
+--------------------------------------------------------------------------------------------------------------+
|[{"Name":"Avery Bradley","Age":25.0,"Salary":7730337.0}, {"Name":"Jae Crowder","Age":25.0,"Salary":6796117.0}]|
+--------------------------------------------------------------------------------------------------------------+

Upvotes: 1

pltc
pltc

Reputation: 6082

You can merge columns to a map then create a JSON out of it

(df
    .withColumn('json', F.to_json(F.create_map(
        F.lit('name'), F.col('name'),
        F.lit('age'), F.col('age'),
        F.lit('salary'), F.col('salary'),
    )))
    .agg(F.collect_list('json').alias('json_value'))
)

+----------------------------------------------------------------------------------------------------------------------+
|json_value                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------+
|[{"name":"Avery Bradley","age":"25.0","salary":"7730337.0"}, {"name":"Jae Crowder","age":"25.0","salary":"6796117.0"}]|
+----------------------------------------------------------------------------------------------------------------------+

Upvotes: 1

Related Questions