Casting type of columns in a dataframe

Question

My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.

I read the file as a dataframe:

var df = spark.read.csv(originalPath);

but when I print schema it gives me all the columns as Strings.

I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.

df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));

I appreciate any help to solve the issue of casting.

Thanks.

Alper t. Turker · Accepted Answer

DataFrames are immutable. Your code creates new DataFrame for each value and discards it.

It is best to use map and select:

val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)

but you could foldLeft:

df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))

or even (please don't) mutable reference:

var df = Seq(("1", "2", "3")).toDF

df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))

Casting type of columns in a dataframe

Answers (2)

Related Questions