Reputation: 103
I am new to Hadoop Hive and I am developing a reporting solution. The problem is that the query performance is really slow (hive 0.10, hbase 0.94, hadoop 1.1.1). One of the queries is:
select a.*, b.country, b.city from p_country_town_hotel b
inner join p_hotel_rev_agg_period a on
(a.key.hotel = b.hotel) where b.hotel = 'AdriaPraha' and a.min_date < '20130701'
order by a.min_date desc
limit 10;
which takes quite a long time (50s). I know I know, the join is on string field and not on integer but the data sets are not big(cca 3300 and 100000 records). I tried hints on this SQL but that didn't turn out any faster. The same query on MS SQL Server lasts 1s. Also a simple count(*) from table lasts 7-8s which is shocking (the table has 3300 records). I really don't know what is the issue? Any ideas or did I misinterpret Hadoop?
Upvotes: 9
Views: 23693
Reputation: 34184
Yes..you have misinterpreted Hadoop. Hadoop, and Hive as well, are not meant for real time stuff. They are most suitable for offline, batch processing kinda stuff. They are not at all a replacement to RDBMSs. Though you can do some fine tuning but 'absolute real time' is not possible. There a lot of things which happen under the hood when you run a hive query, which I think you are not unaware of. First of all you Hive query gets converted into a corresponding MR job followed by few other things like split creation, records generation, mapper generation etc. I would never suggest Hadoop(or Hive) if you have real time needs.
You might wanna have a look at Impala for your real time needs.
Upvotes: 17
Reputation: 1146
Hive is not the appropriate tool for a real-time job, but if you want to leverage the Hadoop infrastructure with real-time or fast data access take a look at HBase
. It's value-add is all about fast access. Not sure why you are selecting Hadoop for your solution, but Hbase sits on top of HDFS which some people like because of the inherent redundancy HDFS offers (you copy a file on there once and it is auto-replicated) which may be one of the reasons you are looking into Hadoop.
For more info: read this question
Upvotes: 3
Reputation: 995
I am not sure how new you are to hadoop.Hive does not give you results at interactive speeds how small the tables are.In case you knew this already and trying to tune the query, you can try below:
select a.*, b.country, b.city from
(select * from p_country_town_hotel where hotel= 'AdriaPraha') b
inner join
(select * from p_hotel_rev_agg_period where min_date < '20130701') a
on
a.key.hotel = b.hotel
order by a.min_date desc
limit 10;
If you know one of the tables is small enough to fit in memory, you can try map side join.
Upvotes: 1