Neha
Neha

Reputation: 547

Create another dataframe from existing dataframe with different schema in spark

I have a dataframe which look like this

root
 |-- A1: string (nullable = true)
 |-- A2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- A3 : string (nullable = true)
 |-- A4 : array (nullable = true)
 |    |-- element: string (containsNull = true)

I have a schema which looks like this-

StructType(StructField(A1,ArrayType(StringType,true),true), StructField(A2,StringType,true), StructField(A3,IntegerType,true),StructField(A4,ArrayType(StringType,true),true)

I want to convert this dataframe to schema defined above. Can someone help me how can i do this ?

Note:- The schema and dataframe are loaded at runtime and they are not fix

Upvotes: 2

Views: 2989

Answers (1)

Mehrez
Mehrez

Reputation: 695

you can use the org.apache.spark.sql.expressions.UserDefinedFunction to transform a string to an array and an arry to string, like this.

 val string_to_array_udf = udf((s:String) => Array(s))
 val array_to_string_udf = udf((a: Seq[String]) => a.head)
 val string_to_int_udf = udf((s:String) => s.toInt)

 val newDf = df.withColumn("a12", string_to_array_udf(col("a1"))).drop("a1").withColumnRenamed("a12", "a1")
 .withColumn("a32", string_to_int_udf(col("a3"))).drop("a3").withColumnRenamed("a32", "a3")
 .withColumn("a22", array_to_string_udf(col("a2"))).drop("a2").withColumnRenamed("a22", "a2")

 newDf.printSchema
 root
   |-- a4: array (nullable = true)
   |    |-- element: string (containsNull = true)
   |-- a1: array (nullable = true)
   |    |-- element: string (containsNull = true)
   |-- a3: integer (nullable = true)
   |-- a2: string (nullable = true)

Upvotes: 3

Related Questions