Reputation: 728
I have a spark Dataframe containing two columns "a" and "b".
For e.g one entry of Data is:
{"firstname" : {"s":"john"},
"secondname":{"s":"cena"} }
I want to add a column by concatenating the names, so that entry is:
{"firstname" : {"s":"john"},
"secondname":{"s":"cena"},
"fullname" :
{"s" : "john cena"}
}
I have used UDF but it is an inefficient solution for large data and acts as a black box for optimizations. Is there any way by using PySpark functions or SQL queries to achieve the result.
Upvotes: 0
Views: 457
Reputation: 4045
Find inline code comments for answer explanation
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SampleJsonData {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[*]").getOrCreate;
//Load your JSON
val df = spark.read.json("src/main/resources/sampleJsonData.json")
//Add a new Column with name "fullname"
df.withColumn("fullname",
//Select nested "firstname.s" and "secondname.s" and assign it to "fullname.s"
struct(concat(col("firstname.s"),lit(" "),col("secondname.s")).as("s")))
//Write your JSON output
.write.json("src/main/resources/sampleJsonDataOutput.json")
}
}
Upvotes: 1