Reputation: 363
I have the following code in pyspark, resulting in a table showing me the different values for a column and their counts. I want to have another column showing what percentage of the total count does each row represent. How do I do that?
difrgns = (
df1
.groupBy("column_name")
.count()
.sort(desc("count"))
.show()
)
Upvotes: 16
Views: 38831
Reputation: 9337
When df
itself is a more complex transformation chain and running it twice -- first to compute the total count and then to group and compute percentages -- is too expensive, it's possible to leverage a window function to achieve similar results. Here's a more generalized code (extending bluephantom's answer) that could be used with a number of group-by dimensions:
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window
data = [
("A", "X", 2, 100),
("A", "X", 7, 100),
("B", "X", 10, 100),
("C", "X", 1, 100),
("D", "X", 50, 100),
("E", "X", 30, 100),
]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=T.int(x[2]), val2=T.int(x[3])))
df = (
sqlContext
.createDataFrame(someschema)
.withColumn(
"total_count",
F.count("*").over(Window.partitionBy(<your N-1 dimensions here>))
)
.groupBy(<your N dimensions here>)
.agg(
(F.count("*") / F.first(F.col("total_count"))).alias("percent_total")
)
)
df.show()
Upvotes: 7
Reputation: 2822
More "beatified" output, eliminating the excess decimals and sort it.
import pyspark.sql.functions as F
count_cl = data_fr.count()
(
data_fr
.groupBy("col_name")
.count()
.withColumn("%", F.round((F.col("count") / count_cl) * 100, 2))
.orderBy("count", ascending=False)
.show(4, False)
)
+--------------+-----+----+
| col_name |count| %|
+--------------------+----+
| C.LQQQQ |30957|8.91|
| C.LQQQQ |29688|8.54|
| C-LQQQQ |29625|8.52|
| CLQQQQ |29342|8.44|
+--------------------+----+
Upvotes: 5
Reputation: 18013
An example as an alternative if not comfortable with Windowing as the comment alludes to and is the better way to go:
# Running in Databricks, not all stuff required
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
("A", "X", 2, 100),
("A", "X", 7, 100),
("B", "X", 10, 100),
("C", "X", 1, 100),
("D", "X", 50, 100),
("E", "X", 30, 100),
]
rdd = sc.parallelize(data)
schema = rdd.map(lambda x: Row(c1=x[0], c2=x[1], val1=int(x[2]), val2=int(x[3])))
df = sqlContext.createDataFrame(schema)
tot = df.count()
(
df
.groupBy("c1")
.count()
.withColumnRenamed('count', 'cnt_per_group')
.withColumn('perc_of_count_total', (F.col('cnt_per_group') / tot) * 100 )
.show()
)
Which returns:
+---+-------------+-------------------+
| c1|cnt_per_group|perc_of_count_total|
+---+-------------+-------------------+
| E| 1| 16.666666666666664|
| B| 1| 16.666666666666664|
| D| 1| 16.666666666666664|
| C| 1| 16.666666666666664|
| A| 2| 33.33333333333333|
+---+-------------+-------------------+
I focus on Scala and it seems easier with that. That said, the suggested solution via the comments uses Window which is what I would do in Scala with over()
.
Upvotes: 23
Reputation: 17804
You can groupby
and aggregate with agg
. For example, for the following DataFrame:
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
You can use:
import pyspark.sql.functions as F
(
df
.groupby("category")
.agg(
F.count("value").alias("count"),
(F.count("value") / df.count()).alias("percentage")
)
.show()
)
Output:
+--------+-----+------------------+
|category|count| percentage|
+--------+-----+------------------+
| b| 1|0.3333333333333333|
| a| 2|0.6666666666666666|
+--------+-----+------------------+
Alternatively, you can use SQL:
df.createOrReplaceTempView("df")
(
spark
.sql(
"""
SELECT category,
COUNT(*) AS count,
COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
)
.show()
)
Upvotes: 6