sjishan
sjishan

Reputation: 3672

Spark ML: Taking square root of feature columns

Hi I am using a custom UDF to take square root of each value in each column.

square_root_UDF = udf(lambda x: math.sqrt(x), DoubleType())

for x in features:
  dataTraining = dataTraining.withColumn(x, square_root_UDF(x))

Is there any faster way to get it done ? Polynomial expansion function is not suitable in this case.

Upvotes: 4

Views: 10008

Answers (3)

Shubham Chaudhary
Shubham Chaudhary

Reputation: 51073

To add sqrt results as a column in scala you need to do the following:

import hc.implicits._
import org.apache.spark.sql.functions.sqrt

val dataTraining = dataTraining.withColumn("x_std", sqrt('x_variance))

Upvotes: 3

Danylo Zherebetskyy
Danylo Zherebetskyy

Reputation: 1517

In order to speed-up your calculation in this case

  1. put your data into a DataFrame (not RDD)
  2. use vectorized operations (not lambda-operations with UDF) as suggested by @user7757642

this is an example if you dataTraining is an RDD then

from pyspark.sql import SparkSession
from pyspark.sql.functions import sqrt

spark = SparkSession.builder.appName("SessionName") \
      .config("spark.some.config.option", "some_value") \
      .getOrCreate()

df = spark.createDataFrame(dataTraining)

for x in features:
    df = df.withColumn(x, sqrt(x))

Upvotes: 0

user7757642
user7757642

Reputation: 91

Don't use UDF. Instead use built-in:

from pyspark.sql.functions import sqrt

for x in features:
    dataTraining = dataTraining.withColumn(x, sqrt(x))

Upvotes: 4

Related Questions