abhay
abhay

Reputation: 113

hive queries is giving wrong result for a condition is not null with many or conditions

I need to exculde all the rows having null in few specified column in hive managed table. when is use "col is not null" or "not isdbnull(col)" with one or two columns it worked fine. But i need to check many col, So when add more or conditions in query, it ignores null condition and gives all rows.

I decide to understand the cause, I reach at conclusion that if all the columns having null same time will give right select result. if any of the isdbnull(col) condition fails will include all rows also which is still having nulls and specified in query with or condition.

Any clue much appreciated.

Upvotes: 0

Views: 7740

Answers (1)

invoketheshell
invoketheshell

Reputation: 3897

You mentioned you used "or" instead of "and" in your query. So you did "(not A) or (Not B)" which is equivalent to "not (A and B)". This will require both to be null. This is different than "not (A or B)" which is the same as "(not A) and (not B)" which is how I wrote the query below. See De Morgans laws for a further explanation.

If you want to select all rows that have non nulls then do this:

 select col1, col2, col3 from table
 where col1 is not null and col2 is not null and col3 is not null;

Additionally if you constitute an empty string as a null value then you can:

Select col1 .... where col1 != '';

I have seen people also do:

Select col1 .... where length(col1) > 0;

How does Hive understand nulls? An empty string is interpreted as empty by Hive, not as NULL. An empty string could be have a different meaning to an application than a NULL so they are interpreted differently.

When you load data the default Missing values are represented by the special value NULL. To import data with NULL fields, check documentation of the SerDe used by the table. The default Text Format uses LazySimpleSerDe which interprets the string \N as NULL when importing. This means you should have \N as values to represent nulls when loading hive.

You can modify this ("serialization.null.format"="") when creating a table to let hive know you have some other value to represent null. In the case here you can see it was set to "" for nulls.

Good luck!

Upvotes: 1

Related Questions