mightyMouse
mightyMouse

Reputation: 728

Concatenate two nested columns in pyspark

I have a spark Dataframe containing two columns "a" and "b".

For e.g one entry of Data is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"} } 

I want to add a column by concatenating the names, so that entry is:

{"firstname" : {"s":"john"}, 
"secondname":{"s":"cena"}, 
"fullname" :
{"s" : "john cena"} 
} 

I have used UDF but it is an inefficient solution for large data and acts as a black box for optimizations. Is there any way by using PySpark functions or SQL queries to achieve the result.

Upvotes: 0

Views: 457

Answers (1)

QuickSilver
QuickSilver

Reputation: 4045

Find inline code comments for answer explanation

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SampleJsonData {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local[*]").getOrCreate;


    //Load your JSON
    val df = spark.read.json("src/main/resources/sampleJsonData.json")

    //Add a new Column with name "fullname"
    df.withColumn("fullname",
      //Select nested "firstname.s" and "secondname.s" and assign it to "fullname.s"
      struct(concat(col("firstname.s"),lit(" "),col("secondname.s")).as("s")))
      //Write your JSON output
      .write.json("src/main/resources/sampleJsonDataOutput.json")


  }

}

Upvotes: 1

Related Questions