PySpark - adding a column to count(*)

Question

I Have the below script (i've removed all the column names etc.. to make it easier to see what I am doing at a high level - it was very messy!!)

I need to add a column that is the equivalent of count(*) in SQL.

So if I have grouped user usage by domain I might see the below - where the count is the number of records that match all the prior column conditiosn.

domain.co.uk/ UK User 32433 domain.co.uk/home EU User 43464 etc...

I'm sure it's been asked somewhere on Stackoverflow before, but I've had a good look around and cant find any reference to it!

vpx_cont_filter = vpx_data\
        .coalesce(1000)\
        .join(....)\
        .select(....)\
        .groupBy(....)\
        .agg(
           ....
            )\
        .select(....)

tricky · Accepted Answer

Do you mean that in your agg, you want to add a column that counts all occurences for each groupBy ?

You can add this then :

.agg(
  F.count(F.lit(1)).alias("total_count"),
  ...
)

By the way, I don't think you're forced to use F.lit(1). In the Spark source code, the have a match case if you specify the star instead of F.lit(1)

// Turn count(*) into count(1)
  case s: Star => Count(Literal(1))
  case _ => Count(e.expr)

So F.count("*") would also work I think

PS : I'm using F. because I assumed you imported the functions package like this

from pyspark.sql import functions as F

PySpark - adding a column to count(*)

Answers (1)

Related Questions