Jitesh Sharma
Jitesh Sharma

Reputation: 75

Condition while aggregation in Spark

This question is related to conditional aggregation on SQLs. Normally we put conditions using 'case' statement in select clause but that case condition checks only the row under consideration. Consider the below data:

BEGIN TRANSACTION;

/* Create a table called NAMES */
CREATE TABLE NAMES(M CHAR, D CHAR, A INTEGER);

/* Create few records in this table */
INSERT INTO NAMES VALUES('M1','Y',2);
INSERT INTO NAMES VALUES('M1','Y',3);
INSERT INTO NAMES VALUES('M2','Y',2);
INSERT INTO NAMES VALUES('M2',null,3);
INSERT INTO NAMES VALUES('M3',null,2);
INSERT INTO NAMES VALUES('M3',null,3);
COMMIT;

This query groups using column 'M' and checks if column 'D' is null or not (separately for each record) and put a sum aggregation on column 'A'.

select sum(case when D = 'Y' then 0 else A end) from NAMES group by M;

Output for this query is:

M1|0
M2|3
M3|5

But if we want to check column 'D' for each record in the group if it is null. If any of the records is 'Y' in the group, do not perform 'sum' aggregation at all. In brief, the expected output for the above scenario is:

M1|0
M2|0
M3|5

Answers in Spark SQL are highly appreciated.

Upvotes: 2

Views: 820

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270443

You can use another case expression:

select (case when max(D) = min(D) and max(D) = 'Y'  -- all the same
             then sum(case when D = 'Y' then 0 else A end)
             else 0
        end)
from NAMES
group by M;

Upvotes: 1

Related Questions