Reputation: 63
Say I have data like the following:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, DoubleType, StructField, StructType, LongType
spark = SparkSession.builder.appName("AveragingArray").getOrCreate()
# Define schema
schema = StructType([
StructField("id", LongType(), True),
StructField("arrays", ArrayType(DoubleType()), True)
])
# Sample data
data = [
(1, [18.2, 50.9]),
(2, [1.0, 3.4, 5.5]),
(3, [10.0, 20.1, 30.0])
]
# Create DataFrame
df = spark.createDataFrame(data, schema=schema)
I want a simple way to convert the arrays
item to be the average of the values in the array. For example, the first row would be 34.55.
I had tried something like this:
from pyspark.sql import functions as F
def average_arrays(df):
# Explode the array to individual elements
exploded_df = df.withColumn("arrays", F.explode("arrays"))
# Group by the original identifier and calculate the average then return as int
result_df = exploded_df.groupBy("id").agg(F.avg("arrays").alias("arrays"))
df = df.withColumn('arrays', col('arrays')[0].cast('int'))
return result_df
However, it still just returns the original array. I'd hugely appreciate any help here, thanks in advance.
Upvotes: 0
Views: 89
Reputation: 35424
The return value in average_arrays()
is incorect
The code should be
def average_arrays(df):
# Explode the array to individual elements
exploded_df = df.withColumn("arrays", F.explode("arrays"))
# Group by the original identifier and calculate the average then return as int
result_df = exploded_df.groupBy("id").agg(F.avg("arrays").alias("arrays"))
return result_df
Aditional tip: I will rewrite the job in the following way for simplicity.
df.createOrReplaceTempView("array_table")
# Execute the SQL query
result_df = spark.sql("""
SELECT
id,
values,
-- Calculate the average of the array elements
aggregate(values, 0.0D, (acc, x) -> acc + x) / size(values) AS average
FROM array_table
""")
result_df.show(truncate=False)
aggregate doc
aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.
Upvotes: 0
Reputation: 1
I can't understand your mean. I try to run your code. I find that it returns the average value of the array. My results are as follows.
+---+------------------+
| id| arrays|
+---+------------------+
| 1| 34.55|
| 3|20.033333333333335|
| 2|3.3000000000000003|
+---+------------------+
Upvotes: 0