Reputation: 139
I have a dataframe as shown below:
+-----------+------------+
|parsed_date| count|
+-----------+------------+
| 2017-12-16| 2|
| 2017-12-16| 2|
| 2017-12-17| 2|
| 2017-12-17| 2|
| 2017-12-18| 1|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-19| 4|
| 2017-12-20| 1|
+-----------+------------+
I want to compute median of the entire 'count' column and add the result to a new column.
I tried:
median = df.approxQuantile('count', [0.5], 0.1).alias('count_median')
But of course I am doing something wrong as it gives the following error:
AttributeError: 'list' object has no attribute 'alias'
Upvotes: 3
Views: 9471
Reputation: 24386
Spark 3.4+ has median
for calculating accurate median (for small/mid-sized dataframes).
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2017-12-16', 2),
('2017-12-16', 2),
('2017-12-17', 2),
('2017-12-17', 2),
('2017-12-18', 1),
('2017-12-19', 4),
('2017-12-19', 4),
('2017-12-19', 4),
('2017-12-19', 4),
('2017-12-20', 1)],
['parsed_date', 'count'])
df.withColumn('median_of_count', F.expr('median(count) over()')).show()
# +-----------+-----+---------------+
# |parsed_date|count|median_of_count|
# +-----------+-----+---------------+
# | 2017-12-16| 2| 2.0|
# | 2017-12-16| 2| 2.0|
# | 2017-12-17| 2| 2.0|
# | 2017-12-17| 2| 2.0|
# | 2017-12-18| 1| 2.0|
# | 2017-12-19| 4| 2.0|
# | 2017-12-19| 4| 2.0|
# | 2017-12-19| 4| 2.0|
# | 2017-12-19| 4| 2.0|
# | 2017-12-20| 1| 2.0|
# +-----------+-----+---------------+
Upvotes: 0
Reputation: 42352
You need to add a column with withColumn
because approxQuantile
returns a list of floats, not a Spark column.
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.lit(df.approxQuantile('count',[0.5],0.1)[0]))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2.0|
| 2017-12-16| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-17| 2| 2.0|
| 2017-12-18| 1| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-19| 4| 2.0|
| 2017-12-20| 1| 2.0|
+-----------+-----+-----------+
You can also use the approx_percentile
/ percentile_approx
function in Spark SQL:
import pyspark.sql.functions as F
df2 = df.withColumn('count_media', F.expr("approx_percentile(count, 0.5, 10) over ()"))
df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16| 2| 2|
| 2017-12-16| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-17| 2| 2|
| 2017-12-18| 1| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-19| 4| 2|
| 2017-12-20| 1| 2|
+-----------+-----+-----------+
Upvotes: 4