Reputation: 730
My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.
I read the file as a dataframe:
var df = spark.read.csv(originalPath);
but when I print schema it gives me all the columns as Strings.
I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.
df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));
I appreciate any help to solve the issue of casting.
Thanks.
Upvotes: 1
Views: 5771
Reputation: 553
Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. During loading the file infer this schema to automatically convert your dataframe columns from string to integer. No explicit conversion required in this case.
import org.apache.spark.sql.types._
val csvSchema = StructType(Array(
StructField("_c0", IntegerType, true),
StructField("_c1", IntegerType, true),
StructField("_c2", IntegerType, true),
StructField("_c3", IntegerType, true)))
val df = spark.read.schema(csvSchema).csv(originalPath)
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)
Upvotes: 1
Reputation: 35249
DataFrames
are immutable. Your code creates new DataFrame
for each value and discards it.
It is best to use map
and select
:
val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)
but you could foldLeft
:
df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))
or even (please don't) mutable reference:
var df = Seq(("1", "2", "3")).toDF
df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))
Upvotes: 5