Reputation: 161
I'm trying to calculate the average for each column in a dataframe and subtract from each element in the column. I've created a function that attempts to do that, but when I try to implement it using a UDF, I get an error: 'float' object has no attribute 'map'. Any ideas on how I can create such a function? Thanks!
def normalize(data):
average=data.map(lambda x: x[0]).sum()/data.count()
out=data.map(lambda x: (x-average))
return out
mapSTD=udf(normalize,IntegerType())
dats = data.withColumn('Normalized', mapSTD('Fare'))
Upvotes: 2
Views: 6831
Reputation: 51
Adding onto Piotr's answer. If you need to keep the existing dataframe and add normalized columns with aliases, the function can be modified as:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = ['*']
for column in columns:
selectExpr.append((df[column] - averages[column]).alias('normalized_'+column))
return df.select(selectExpr)
Upvotes: 3
Reputation: 689
In your example there is problem with UDF function which can not be applied to row and whole DataFrame. UDF can be applied only to single row, but Spark also enables implementing UDAF (User Defined Aggregate Functions) working on whole DataFrame.
To solve your problem you can use below function:
from pyspark.sql.functions import mean
def normalize(df, column):
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
return df.select(df[column] - average)
Use it like this:
normalize(df, "Fare")
Please note that above only works on single column, but it is possible to implement something more generic:
def normalize(df, columns):
selectExpr = []
for column in columns:
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
selectExpr.append(df[column] - average)
return df.select(selectExpr)
use it like:
normalize(df, ["col1", "col2"])
This works, but you need to run aggregation for each column, so with many columns performance could be issue, but it is possible to generate only one aggregate expression:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = []
for column in columns:
selectExpr.append(df[column] - averages[column])
return df.select(selectExpr)
Upvotes: 5