user1783504
user1783504

Reputation: 453

Pyspark - cube aggregation

I'm trying to use the cube function in pyspark without including all the columns in the cube.

SQL equivalent of what I am trying to achieve:

select col1, col2, col3, sum(col4) from table group by col1, cube(col2, col3)

this groups groups by col1 and all combinations of col2 & col3

in pyspark, I get the message GroupedData object has no attibute 'cube' when running the below

spark.table("table").groupBy(col1).cube(col2,col3).agg(sum(col4))

I'm able to use cube but I need to include col1 which I don't want to

spark.table("table").cube(col1,col2,col3).agg(sum(col4))

Upvotes: 1

Views: 49

Answers (1)

werner
werner

Reputation: 14845

There are two options:

  1. Use the SQL:
spark.sql("""
    col1, col2, col3, sum(col4)
    from table 
    group by col1, cube(col2, col3)
""").show()
  1. Use the Dataframe API and filter out the additional dimension:
from pyspark.sql import functions as F

spark.table("data") \
    .cube("col1", "col2", "col3") \
    .agg(F.sum("col4")) \
    .where(F.col("col1").isNotNull()) \
    .show()

Upvotes: 0

Related Questions