Reputation: 39
I'm on Databricks and I'm working on a classification problem. I have a DataFrame with 2000+ columns. I want to cast all the columns that will become features to double.
val array45 = data.columns drop(1)
for (element <- array45) {
data.withColumn(element, data(element).cast("double"))
}
data.printSchema()
The cast to double is working but I'm not saving it in the DataFrame called Data. If I create a new DataFrame in the loop ; outside of the for loops my DataFrame won't exist. I do not want to use UDF.
How can I solve this ?
EDIT : Thanks both of you for your answer ! I don't know why but the answer of Shaido and Raul are taking a bunch of time to compute. It comes from Databricks, I think.
Upvotes: 0
Views: 2377
Reputation: 41987
you can simply write a function to cast
a column
to doubleType
and use the function in select
method.
The function:
import org.apache.spark.sql.types._
def func(column: Column) = column.cast(DoubleType)
And then use the function in select
as
val array45 = data.columns.drop(1)
import org.apache.spark.sql.functions._
data.select(array45.map(name => func(col(name))): _*).show(false)
I hope the answer is helpful
Upvotes: 3
Reputation: 395
Let me suggest use a foldLeft:
val array45 = data.columns drop(1)
val newData = array45.foldLeft(data)(
(acc,c) =>
acc.withColumn(c, data(c).cast("double")))
newData.printSchema()
Hope this helps!
Upvotes: 1
Reputation: 28422
You can assign the new dataframe to a var
at every iteration, thus keeping the most recent one at all times.
var finalData = data.cache()
for (element <- array45) {
finalData = finalData.withColumn(element, finalData(element).cast("double"))
}
Upvotes: 1