Reputation: 768
I have a dataframe as below:
+----------+----------+--------+
| FNAME| LNAME| AGE|
+----------+----------+--------+
| EARL| JONES| 35|
| MARK| WOOD| 20|
+----------+----------+--------+
I am trying to add a new column as value to this dataframe which should be like this:
+----------+----------+--------+------+------------------------------------+
| FNAME| LNAME| AGE| VALUE |
+----------+----------+--------+-------------------------------------------+
| EARL| JONES| 35|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
| MARK| WOOD| 20|{"FNAME":"MARK","WOOD":"JONES","AGE":"20"} |
+----------+----------+--------+-------------------------------------------+
I am not able to achieve this using withColumn or any json function.
Any headstart would be appreciated.
Spark: 2.3
Python: 3.7.x
Upvotes: 2
Views: 1882
Reputation: 768
Achieved using:
df.withColumn("VALUE", to_json(struct([df[x] for x in df.columns])))
Upvotes: 1
Reputation: 2451
scala solution:
val df2 = df.select(
to_json(
map_from_arrays(lit(df.columns), array('*))
).as("value")
)
pyton solution: (I don't know how to do it for n-cols like in scala because map_from_arrays
not exists in pyspark)
import pyspark.sql.functions as f
df.select(f.to_json(
f.create_map(f.lit("FNAME"), df.FNAME, f.lit("LNAME"), df.LNAME, f.lit("AGE"), df.AGE)
).alias("value")
).show(truncate=False)
output:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
|{"FNAME":"MARK","LNAME":"WOOD","AGE":"20"} |
+-------------------------------------------+
Upvotes: 1
Reputation: 1590
Please consider using the SQL function to_json
which you can find in org.apache.spark.sql.functions
Here's the solution :
df.withColumn("VALUE", to_json(struct($"FNAME", $"LNAME", $"AGE"))
And you can also avoid specifying the columns' names as follows :
df.withColumn("VALUE", to_json(struct(df.columns.map(col): _*)
PS: the code I provided is written in scala, but it's the same logic for Python, you just have to use a spark SQL function which is available in both programming languages.
I hope It helps,
Upvotes: 1