Reputation: 71
I am splitting 'split_column' to another five columns as per the following code. However I wanted to have this new columns to be renamed so that they would have some meaningful names(let's say new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5" in this example)
val df1 = df.withColumn("new_column", split(col("split_column"), "\\|")).select(col("*") +: (0 until 5).map(i => col("new_column").getItem(i).as(s"newcol$i")): _*).drop("split_column","new_column")
val new_columns_renamed = Seq("....., "new_renamed1", "new_renamed2", "new_renamed3", "new_renamed4", "new_renamed5")
val df2 = df1.toDF(new_columns_renamed: _*)
However issue with this approach is some of my splits might have more than fifty new rows. In thi renaming approach, a little typo (like extra comma, missing double quotes) would be painful to detect.
Is there a way to rename columns with case class like below ?
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
Please note that in the actual scenario names would not look like new_renamed1, new_renamed2, ......, new_renamed5 , they would be totally different.
Upvotes: 0
Views: 1784
Reputation: 35229
You could try something like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders
val names = Encoders.product[SplittedRecord].schema.fieldNames
names.zipWithIndex
.foldLeft(df.withColumn("new_column", split(col("split_column"), "\\|")))
{ case (df, (c, i)) => df.withColumn(c, $"new_column"(i)) }
Upvotes: 4
Reputation: 41957
One of the ways to use the case class
case class SplittedRecord (new_renamed1: String, new_renamed2: String, new_renamed3: String, new_renamed4: String, new_renamed5: String)
is through udf
function as
import org.apache.spark.sql.functions._
def splitUdf = udf((array: Seq[String])=> SplittedRecord(array(0), array(1), array(2), array(3), array(4)))
df.withColumn("test", splitUdf(split(col("split_column"), "\\|"))).drop("split_column")
.select(col("*"), col("test.*")).drop("test")
Upvotes: 2