earl
earl

Reputation: 768

Create a column with values created from all other columns as a JSON in PySPARK

I have a dataframe as below:

+----------+----------+--------+
|     FNAME|     LNAME|     AGE| 
+----------+----------+--------+
|      EARL|     JONES|      35|
|      MARK|      WOOD|      20|
+----------+----------+--------+

I am trying to add a new column as value to this dataframe which should be like this:

+----------+----------+--------+------+------------------------------------+
|     FNAME|     LNAME|     AGE|                            VALUE          |
+----------+----------+--------+-------------------------------------------+
|      EARL|     JONES|      35|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
|      MARK|      WOOD|      20|{"FNAME":"MARK","WOOD":"JONES","AGE":"20"} |
+----------+----------+--------+-------------------------------------------+

I am not able to achieve this using withColumn or any json function.

Any headstart would be appreciated.

Spark: 2.3
Python: 3.7.x

Upvotes: 2

Views: 1882

Answers (3)

earl
earl

Reputation: 768

Achieved using:

df.withColumn("VALUE", to_json(struct([df[x] for x in df.columns])))

Upvotes: 1

chlebek
chlebek

Reputation: 2451

scala solution:

  val df2 = df.select(
    to_json(
      map_from_arrays(lit(df.columns), array('*))
    ).as("value")
  )

pyton solution: (I don't know how to do it for n-cols like in scala because map_from_arrays not exists in pyspark)

    import pyspark.sql.functions as f

    df.select(f.to_json(
        f.create_map(f.lit("FNAME"), df.FNAME, f.lit("LNAME"), df.LNAME, f.lit("AGE"), df.AGE)
        ).alias("value")
    ).show(truncate=False)

output:

    +-------------------------------------------+
    |value                                      |
    +-------------------------------------------+
    |{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
    |{"FNAME":"MARK","LNAME":"WOOD","AGE":"20"} |
    +-------------------------------------------+


Upvotes: 1

mahmoud mehdi
mahmoud mehdi

Reputation: 1590

Please consider using the SQL function to_jsonwhich you can find in org.apache.spark.sql.functions

Here's the solution :

df.withColumn("VALUE", to_json(struct($"FNAME", $"LNAME", $"AGE"))

And you can also avoid specifying the columns' names as follows :

df.withColumn("VALUE", to_json(struct(df.columns.map(col): _*)

PS: the code I provided is written in scala, but it's the same logic for Python, you just have to use a spark SQL function which is available in both programming languages.

I hope It helps,

Upvotes: 1

Related Questions