Umesh Kacha
Umesh Kacha

Reputation: 13656

Spark DataFrame change datatype based on column condition

I have one Spark DataFrame df1 of around 1000 columns all of String type columns. Now I want to convert df1's columns' type from string to other types like double, int etc based on conditions of column names. For e.g. let's assume df1 has only three columns of string type

df1.printSchema

col1_term1: String
col2_term2: String 
col3_term3: String

Condition to change column type is if col name contains term1 then change it to int and if col name contains term2 then change it to double and so on. I am new to Spark.

Upvotes: 4

Views: 7209

Answers (2)

y2k-shubham
y2k-shubham

Reputation: 11607

While it wouldn't produce any different results than the solution proposed by @Psidom, you can also use a bit of Scala's syntactic-sugar like this

val modifiedDf: DataFrame = originalDf.columns.foldLeft[DataFrame](originalDf) { (tmpDf: DataFrame, colName: String) =>
  if (colName.contains("term1")) tmpDf.withColumn(colName, tmpDf(colName).cast(IntegerType))
  else if (colName.contains("term2")) tmpDf.withColumn(colName, tmpDf(colName).cast(DoubleType))
  else tmpDf
}

Upvotes: 1

akuiper
akuiper

Reputation: 214927

You can simply map over columns, and cast the column to proper data type based on the column names:

import org.apache.spark.sql.types._

val df = Seq(("1", "2", "3"), ("2", "3", "4")).toDF("col1_term1", "col2_term2", "col3_term3")

val cols = df.columns.map(x => {
    if (x.contains("term1")) col(x).cast(IntegerType) 
    else if (x.contains("term2")) col(x).cast(DoubleType) 
    else col(x)
})

df.select(cols: _*).printSchema
root
 |-- col1_term1: integer (nullable = true)
 |-- col2_term2: double (nullable = true)
 |-- col3_term3: string (nullable = true)

Upvotes: 8

Related Questions