Reputation: 45
Given the two tables below, for each datapoint, I want to count the number of distinct years for which we have a value. However, Spark SQL does not allow combining COUNT DISTINCT and FILTER.
CREATE TABLE datapoints (name STRING);
INSERT INTO
datapoints
VALUES
('Name'),
('Height'),
('Color');
CREATE TABLE entities (datapoint STRING, year INT, value STRING);
INSERT INTO
entities
VALUES
('Name', 2015, 'John'),
('Name', 2015, 'Suzan'),
('Name', 2017, 'Jim'),
('Color', 2015, 'Blue')
SELECT
dp.name,
COUNT(DISTINCT year) FILTER (
WHERE
value IS NOT NULL
) as DPCount
FROM
datapoints as dp
LEFT JOIN entities on datapoint = dp.name
GROUP BY
dp.name
Results in:
Error in SQL statement: AnalysisException: DISTINCT and FILTER cannot be used in aggregate functions at the same time; line 3 pos 2
What would be the functionally equivalent valid Spark SQL statement? The expected output is (notice the duplicate year for 'Name'):
name | DPCount |
---|---|
Color | 1 |
Height | 0 |
Name | 2 |
Upvotes: 1
Views: 1375
Reputation: 42422
Try doing count distinct on a case when:
SELECT
dp.name,
COUNT(DISTINCT case when value is not null then year end) as DPCount
FROM
datapoints as dp
LEFT JOIN entities on datapoint = dp.name
GROUP BY
dp.name
Upvotes: 2