Reputation: 3667
This is type of query I am sending to HIVE:
SELECT BigTable.nicefield,LargeTable.*
FROM LargeTable INNER JOIN BigTable
ON (
LargeTable.joinfield1of4 = BigTable.joinfield1of4
AND LargeTable.joinfield2of4 = BigTable.joinfield2of4
)
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15 AND BigTable.joinfield3of4=20140726 AND BigTable.joinfield4of4=15
AND LargeTable.filterfiled1of2=123456
AND LargeTable.filterfiled2of2=98765
AND LargeTable.joinfield2of4=12
AND LargeTable.joinfield1of4='iwanttolikehive'
It returns 2418025
rows. The issue is that
SELECT *
FROM LargeTable
WHERE joinfield3of4=20140726 AND joinfield4of4=15
AND filterfiled1of2=123456
AND filterfiled2of2=98765
AND joinfield2of4=12
AND joinfield1of4='iwanttolikehive'
returns 1555
rows, and so does:
SELECT *
FROM BigTable
WHERE joinfield3of4=20140726 AND joinfield4of4=15
AND joinfield2of4=12
AND joinfield1of4='iwanttolikehive'
Note that 1555^2 = 2418025.
Upvotes: 3
Views: 1882
Reputation: 3667
It turns out that the correct version of the query should be:
SELECT bt.nicefield,LargeTable.*
FROM LargeTable INNER JOIN
(
SELECT nicefield, joinfield1of4,joinfield2of4, count(*) as rows
FROM BigTable
WHERE joinfield3of4=20140726 ANDjoinfield4of4=15
GROUP BY nicefield, joinfield1of4,joinfield2of4
) bt
ON (
LargeTable.joinfield1of4 = bt.joinfield1of4
AND LargeTable.joinfield2of4 = bt.joinfield2of4
)
WHERE LargeTable.joinfield3of4=20140726 AND LargeTable.joinfield4of4=15
AND LargeTable.filterfiled1of2=123456
AND LargeTable.filterfiled2of2=98765
AND LargeTable.joinfield2of4=12
AND LargeTable.joinfield1of4='iwanttolikehive'
The issue is that in the original query, the join on the BigTable
was returning duplicates.
This is NOT an issue, queries just have to be written carefully!!! I hope this help!
Upvotes: 2