spalfy
spalfy

Reputation: 45

Combining COUNT DISTINCT with FILTER - Spark SQL

Given the two tables below, for each datapoint, I want to count the number of distinct years for which we have a value. However, Spark SQL does not allow combining COUNT DISTINCT and FILTER.

CREATE TABLE datapoints (name STRING);

INSERT INTO
  datapoints
VALUES
  ('Name'),
  ('Height'),
  ('Color');

CREATE TABLE entities (datapoint STRING, year INT, value STRING);

INSERT INTO
  entities
VALUES
  ('Name', 2015, 'John'),
  ('Name', 2015, 'Suzan'),
  ('Name', 2017, 'Jim'),
  ('Color', 2015, 'Blue')

SELECT
  dp.name,
  COUNT(DISTINCT year) FILTER (
    WHERE
      value IS NOT NULL
  ) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

Results in:

Error in SQL statement: AnalysisException: DISTINCT and FILTER cannot be used in aggregate functions at the same time; line 3 pos 2

What would be the functionally equivalent valid Spark SQL statement? The expected output is (notice the duplicate year for 'Name'):

name DPCount
Color 1
Height 0
Name 2

Upvotes: 1

Views: 1375

Answers (1)

mck
mck

Reputation: 42422

Try doing count distinct on a case when:

SELECT
  dp.name,
  COUNT(DISTINCT case when value is not null then year end) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

Upvotes: 2

Related Questions