Compare one value of column A with all the values of column B in Hive HQL

Question

I have two columns in one table say Column A and Column B. I need to search each value of Column A with All the values of column B each and every time and return true if the column A value is found in any of the rows of column B. How can i get this?

I have tried using the below command:

select column _A, column_B,(if (column_A =column_B), True, False) as test from sample;

If i use the above command, it is checking for that particular row alone. But I need true value, if a value of column A is found in any of the rows of column B. How can i can check one value of column A with all the all the values of column B?

Or Is there any possibility to iterate and compare each value between two columns?

sujit · Accepted Answer

Solution

create temporary table t as select rand() as id, column_A, column_B from sample; --> Refer 1

select distinct t3.id,t3.column_A,t3.column_B,t3.match from ( --> Refer 3
select t1.id as id, t1.column_A as column_A, t1.column_B as column_B,--> Refer 2 
if(t2.column_B is null, False, True) as match  from t t1 LEFT OUTER JOIN
   t t2  ON t1.column_A = t2.column_B 
) t3;

Explanation

Create an identifier column to keep track of the rows in original table. I am using rand() here. We will take advantage of this to get the original rows in Step 3. Creating a temporary table t here for simplicity in next steps.
Use a LEFT OUTER JOIN with self to do your test that requires matching each column with another across all rows, yielding the match column. Note that here multiple duplicate rows may get created than in Sample table, but we have got a handle on the duplicates, since the id column for them will be same.
In this step, we apply distinct to get the original rows as in Sample table. You can then ditch the id column.

Notes

Self joins are costly in terms of performance, but this is unavoidable for solution to the question.
The distinct used in Step 3, is costly too. A more performant approach would be to use Window functions where we can partition by the id and pick the first row in the window. You can explore that.

Compare one value of column A with all the values of column B in Hive HQL

Answers (2)

Solution

Explanation

Notes

Related Questions