Samiksha
Samiksha

Reputation: 139

Compute median of column in PySpark

I have a dataframe as shown below:

+-----------+------------+
|parsed_date|       count|
+-----------+------------+
| 2017-12-16|           2|
| 2017-12-16|           2|
| 2017-12-17|           2|
| 2017-12-17|           2|
| 2017-12-18|           1|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-19|           4|
| 2017-12-20|           1|
+-----------+------------+

I want to compute median of the entire 'count' column and add the result to a new column.

I tried:

median = df.approxQuantile('count', [0.5], 0.1).alias('count_median')

But of course I am doing something wrong as it gives the following error:

AttributeError: 'list' object has no attribute 'alias'

Upvotes: 3

Views: 9471

Answers (2)

ZygD
ZygD

Reputation: 24386

Spark 3.4+ has median for calculating accurate median (for small/mid-sized dataframes).

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('2017-12-16', 2),
     ('2017-12-16', 2),
     ('2017-12-17', 2),
     ('2017-12-17', 2),
     ('2017-12-18', 1),
     ('2017-12-19', 4),
     ('2017-12-19', 4),
     ('2017-12-19', 4),
     ('2017-12-19', 4),
     ('2017-12-20', 1)],
    ['parsed_date', 'count'])

df.withColumn('median_of_count', F.expr('median(count) over()')).show()
# +-----------+-----+---------------+
# |parsed_date|count|median_of_count|
# +-----------+-----+---------------+
# | 2017-12-16|    2|            2.0|
# | 2017-12-16|    2|            2.0|
# | 2017-12-17|    2|            2.0|
# | 2017-12-17|    2|            2.0|
# | 2017-12-18|    1|            2.0|
# | 2017-12-19|    4|            2.0|
# | 2017-12-19|    4|            2.0|
# | 2017-12-19|    4|            2.0|
# | 2017-12-19|    4|            2.0|
# | 2017-12-20|    1|            2.0|
# +-----------+-----+---------------+

Upvotes: 0

mck
mck

Reputation: 42352

You need to add a column with withColumn because approxQuantile returns a list of floats, not a Spark column.

import pyspark.sql.functions as F

df2 = df.withColumn('count_media', F.lit(df.approxQuantile('count',[0.5],0.1)[0]))

df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16|    2|        2.0|
| 2017-12-16|    2|        2.0|
| 2017-12-17|    2|        2.0|
| 2017-12-17|    2|        2.0|
| 2017-12-18|    1|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-19|    4|        2.0|
| 2017-12-20|    1|        2.0|
+-----------+-----+-----------+

You can also use the approx_percentile / percentile_approx function in Spark SQL:

import pyspark.sql.functions as F

df2 = df.withColumn('count_media', F.expr("approx_percentile(count, 0.5, 10) over ()"))

df2.show()
+-----------+-----+-----------+
|parsed_date|count|count_media|
+-----------+-----+-----------+
| 2017-12-16|    2|          2|
| 2017-12-16|    2|          2|
| 2017-12-17|    2|          2|
| 2017-12-17|    2|          2|
| 2017-12-18|    1|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-19|    4|          2|
| 2017-12-20|    1|          2|
+-----------+-----+-----------+

Upvotes: 4

Related Questions