Xavier_prash
Xavier_prash

Reputation: 27

PySpark create a json string by combining columns

I have a dataframe.

from pyspark.sql.types import *
input_schema = StructType(
            [
                StructField("ID", StringType(), True),
                StructField("Date", StringType(), True),
                StructField("code", StringType(), True),
            ])
input_data = [
            ("1", "2021-12-01", "a"),
            ("2", "2021-12-01", "b"),
        ]
input_df = spark.createDataFrame(data=input_data, schema=input_schema)

enter image description here

I would like to perform a transformation that combines a set of columns and stuff into a json string. The columns to be combined is known ahead of time. The output should look like something below.

enter image description here

Is there any sugggested method to achieve this? Appreciate any help on this.

Upvotes: 1

Views: 87

Answers (1)

anky
anky

Reputation: 75080

You can create a struct type and then convert to json:

from pyspark.sql import functions as F
col_to_combine = ['Date','code']
output = input_df.withColumn('combined',F.to_json(F.struct(*col_to_combine)))\
                 .drop(*col_to_combine)

output.show(truncate=False)
+---+--------------------------------+
|ID |combined                        |
+---+--------------------------------+
|1  |{"Date":"2021-12-01","code":"a"}|
|2  |{"Date":"2021-12-01","code":"b"}|
+---+--------------------------------+

Upvotes: 1

Related Questions