JMordijck
JMordijck

Reputation: 73

Hive query drops null values with "not column=value" where clause

table1 data sample:

year month day utmsource
2017 03    26  NULL
2017 03    27  NULL
2017 03    27  facebook
2017 03    27  newsletter
2017 03    27  banner
2017 03    27  facebook    

expected selection:

year month day utmsource
2017 03    27  NULL
2017 03    27  newsletter
2017 03    27  banner 

My Hive queries:

-- result = 0, it did not include the NULL utmsource record
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource="facebook"

-- result = 1 the NULL utmsource record is included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND (utmsource IS NULL OR NOT utmsource="facebook")

-- also returns 0, the NULL utmsource record is not included
SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM table1
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

Questions:

  1. Could someone explain this behaviour?
  2. Can i change a setting to retrieve the result of query 2 without adding the extra OR functionality in my query? => not equals includes null values in result

Upvotes: 1

Views: 1844

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

What you want is a NULL-safe equality (or inequality) operator. In ANSI SQL, there is an operator called is distinct from. Hive appears to use the MySQL version, which is <=>. So, you can do:

SELECT SUM(CASE WHEN utmsource IS NULL THEN 1 ELSE 0 END) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

This operator is described in the documentation.

I should also point out that you might find this a simpler formulation for the SELECT:

SELECT (COUNT(*) - COUNT(utmsource)) as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND NOT utmsource <=> 'facebook';

Although, overall, this would appear to be the simplest:

SELECT COUNT(*)as amountnull
FROM tablename
WHERE year=2017 AND month=03 AND day=27 AND utmsource IS NULL;

The comparison to 'Facebook' is unnecessary.

Upvotes: 4

Related Questions