Reputation: 1996
I Have the below script (i've removed all the column names etc.. to make it easier to see what I am doing at a high level - it was very messy!!)
I need to add a column that is the equivalent of count(*) in SQL.
So if I have grouped user usage by domain I might see the below - where the count is the number of records that match all the prior column conditiosn.
domain.co.uk/ UK User 32433 domain.co.uk/home EU User 43464 etc...
I'm sure it's been asked somewhere on Stackoverflow before, but I've had a good look around and cant find any reference to it!
vpx_cont_filter = vpx_data\
.coalesce(1000)\
.join(....)\
.select(....)\
.groupBy(....)\
.agg(
....
)\
.select(....)
Upvotes: 1
Views: 2940
Reputation: 1553
Do you mean that in your agg, you want to add a column that counts all occurences for each groupBy ?
You can add this then :
.agg(
F.count(F.lit(1)).alias("total_count"),
...
)
By the way, I don't think you're forced to use F.lit(1). In the Spark source code, the have a match case if you specify the star instead of F.lit(1)
// Turn count(*) into count(1)
case s: Star => Count(Literal(1))
case _ => Count(e.expr)
So F.count("*") would also work I think
PS : I'm using F. because I assumed you imported the functions package like this
from pyspark.sql import functions as F
Upvotes: 5