HP.
HP.

Reputation: 19896

Hive multiple subqueries in WHERE predicate workaround

I have three tables and I would like to query table3 using conditions based on table1 and table2. Here is the simplified version of the data and queries:

CREATE TABLE table1 (
  id int
);

INSERT INTO table1 VALUES(1);
INSERT INTO table1 VALUES(2);
INSERT INTO table1 VALUES(3);

+------------+--+
| table1.id  |
+------------+--+
| 1          |
| 2          |
| 3          |
+------------+--+

CREATE TABLE table2 (
  code varchar(10)
);

INSERT INTO table2 VALUES('a');
INSERT INTO table2 VALUES('b');
INSERT INTO table2 VALUES('c');

+--------------+--+
| table2.code  |
+--------------+--+
| a            |
| b            |
| c            |
+--------------+--+

CREATE TABLE table3 (
  id int,
  code varchar(10)
);

INSERT INTO table3 VALUES(1,'d');
INSERT INTO table3 VALUES(1,'a');
INSERT INTO table3 VALUES(2,'b');
INSERT INTO table3 VALUES(2,'e');
INSERT INTO table3 VALUES(4,'a');
INSERT INTO table3 VALUES(4,'d');

+------------+--------------+--+
| table3.id  | table3.code  |
+------------+--------------+--+
| 1          | d            |
| 1          | a            |
| 2          | b            |
| 2          | e            |
| 4          | a            |
| 4          | d            |
+------------+--------------+--+

Basically, I like to get the records from table3 only if id exists in table1 and code not exists in table2. So the result should be just

1,d
2,e

This below query doesn't work:

SELECT * FROM table3 WHERE (table3.id IN (SELECT table1.id FROM
table1)) AND NOT (table3.code IN (SELECT table2.code FROM table2));

I got this error:

Error: Error while compiling statement: FAILED: SemanticException [Error 10249]: Line 1:94 Unsupported SubQuery Expression 'code': Only 1 SubQuery expression is supported. (state=42000,code=10249)

Independently, each condition works fine:

SELECT * FROM table3 WHERE (table3.id IN (SELECT table1.id FROM table1));

+------------+--------------+--+
| table3.id  | table3.code  |
+------------+--------------+--+
| 1          | d            |
| 1          | a            |
| 2          | b            |
| 2          | e            |
+------------+--------------+--+

SELECT * FROM table3 WHERE NOT (table3.code IN (SELECT table2.code FROM table2));

+------------+--------------+--+
| table3.id  | table3.code  |
+------------+--------------+--+
| 1          | d            |
| 2          | e            |
| 4          | d            |
+------------+--------------+--+

IMPORTANT: I cannot do JOIN or modify anything in FROM as this is a part of a reporting system so the only thing I can do is to tweak WHERE clause.

Upvotes: 4

Views: 8956

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269463

NOT IN is a simpler way to write your query:

SELECT t3.*
FROM table3 t3
WHERE t3.id IN (SELECT table1.id FROM table1) AND
      t3.code NOT IN (SELECT table2.code FROM table2);

If you are limited to one subquery, this gets tricky. Here is one possibility, although I'm not sure if hive will accept it:

where exists (select 1
              from table1 t1
              where t1.id = t3.id and
                    not exists (select 1
                                from table2 t2
                                where t2.code = t3.code
                               )
             )

You can do this without the double nesting:

where exists (select 1
              from table1 t1 left join
                   table2 t2
                   on t2.code = t3.code
              where t1.id = t3.id
             )

Upvotes: 0

Lukasz Szozda
Lukasz Szozda

Reputation: 175566

You could rewrite it using JOINs:

SELECT DISTINCT t3.*
FROM table3 t3
JOIN table1 t1
  ON t3.id = t1.id
LEFT JOIN table2 t2
  ON t2.code = t3.code
WHERE t2.code IS NULL;

DBFiddle Demo


the only thing I can do is to tweak WHERE clause.

SELECT *
FROM Table3 t
WHERE EXISTS (SELECT 1
              FROM table3 t3
              JOIN table1 t1
                ON t3.id = t1.id
              LEFT JOIN table2 t2
                ON t2.code = t3.code
             WHERE t2.code IS NULL
               AND t3.id = t.id
               AND t3.code = t.code)

DBFiddle Demo2

Upvotes: 2

Mureinik
Mureinik

Reputation: 311018

One dirty trick you can use is to cross join table1 and table2, since you don't care about their relationship anyway, and use an exists condition:

SELECT *
FROM   table3 
WHERE  NOT EXISTS (SELECT     *
                   FROM       table1
                   CROSS JOIN table2
                   WHERE      table3.id = table1.id ON table3.code = table2.code)

EDIT:
While the above query should work, its performance probably won't be great. A slightly faster variant would be to use union all in the subquery:

SELECT *
FROM   table3 
WHERE  NOT EXISTS (SELECT     *
                   FROM       table1
                   WHERE      table3.id = table1.id 
                   UNION ALL
                   SELECT     *
                   FROM       table2
                   WHERE      table3.code = table2.code)

Upvotes: 1

Related Questions