Reputation: 333
I am running Apache Hadoop 2.6.0 on Ubuntu 14.0 and I have a table created in Hive 0.13.0 as:
CREATE TABLE IF NOT EXISTS recipes_hive.cuisine (
ID INT COMMENT 'Cuisine ID.',
name STRING COMMENT 'Cusine name - primary key.',
area STRING COMMENT 'Name of the area of origin - foreign key.',
scope STRING COMMENT 'Either country or area.')
COMMENT 'Table containing cuisines data.'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
And I fill it with data with statement:
LOAD DATA LOCAL INPATH 'path_to_file/CUISINE.csv'
OVERWRITE INTO TABLE recipes_hive.cuisine;
My database has several such tables all created and filled with same procedure. When running simple queries like:
SELECT * FROM cuisine
or even with certain conditions in WHERE clause I get the results as expected but running more complex queries I get squat. For example:
SELECT cuisine.name, SUM(IF (ingredient.category = "fruit",1,2))/count(*) AS PERC
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
GROUP BY cuisine.name
ORDER BY PERC DESC
, or:
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id IN (
SELECT ingredient.id
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE c.name = "Pakistan") AND cuisine.name = "Bangladesh"
The first example calculates some percentage and the second one checks for mutual elements.
The MapReduce and Hadoop get invoked properly and they return no error. The output ends with:
Execution completed successfully
MapredLocal task succeeded
OK
Time taken: 122.119 seconds
I have checked the web and people had similar issues as I do. I checked:
Hive Table returning empty result set on all queries
but failed to get my issue resolved. The data is in fact in the HDFS and as mentioned before it works on simple queries.
So either there is something wrong with my Hive instance or my queries are not written correctly.
Any help would be greatly appreciated. Best regards.
Upvotes: 2
Views: 5690
Reputation: 2113
If we have Cuisine table containing ID = {1,2,3} and Recipe table containing ID = {5,6,7} , then even though these tables are non-empty, we still get no rows returned when we do INNER JOIN Cuisine.ID = Recipe.ID (as the ID's are different in the 2 tables) Can you please check that there is no such condition.
SELECT count(1)
FROM cuisine c JOIN recipe ON recipe.cuisine = c.name WHERE c.name = "Pakistan";
--- must return > 0
select count(1) from recipe as recipe
JOIN part_of ON part_of.id_recipe = recipe.id ;
--- must return > 0
select count(1) from part_of as part_of
JOIN ingredient ON ingredient.name = part_of.ingredient ;
--- must return > 0
So the inner query returns a row when all the count(*) are non-zero. Now test the outer select :
SELECT ingredient.id, ingredient.name
FROM cuisine JOIN recipe ON recipe.cuisine = cuisine.name JOIN part_of ON part_of.id_recipe = recipe.id JOIN ingredient ON ingredient.name = part_of.ingredient
WHERE ingredient.id = <inner query result> and cuisine.name = "Bangladesh";
Upvotes: 1
Reputation: 2113
Are you sure that the resulting join will be non-empty. Since, you have implemented inner joins, even if one table has missing records, the entire result set is 0. Try adding a left join with "IS NULL" to validate that all the tables contribute to the result set. If all the child tables have non-null values in their respective columns post-join, then the query is good.
Upvotes: 1