Reputation: 883
I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped =
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))
The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.
I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.
Upvotes: 67
Views: 168410
Reputation: 1164
By using Spark/PySpark SQL
y.createOrReplaceTempView("STUDENT")
spark.sql("SELECT year, count(DISTINCT id) as count" + \
"FROM STUDENT group by year").show()
Upvotes: 0
Reputation: 3197
If you are working with an older Spark version and don't have the countDistinct
function, you can replicate it using the combination of size
and collect_set
functions like so:
gr = gr.groupBy("year").agg(fn.size(fn.collect_set("id")).alias("distinct_count"))
In case you have to count distinct over multiple columns, simply concatenate the columns into a new one using concat
and perform the same as above.
Upvotes: 1
Reputation: 131
countDistinct()
and multiple aggr both are not supported in streaming.
Upvotes: 1
Reputation: 3118
You can also do:
gr.groupBy("year", "id").count().groupBy("year").count()
This query will return the unique students per year.
Upvotes: 6
Reputation: 4291
Use countDistinct function
from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])
gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()
output
+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002| 2|
|2001| 2|
+----+------------------+
Upvotes: 141