Increasing querying performance for huge dataset

Question

I am new to Hive and SQL. I am currently querying the database to extract rows of data based on

SELECT * FROM database.table WHERE A = '980dsf9sfjklsdfj' AND B = '141519384938' AND C = 'URL'

A --> some id value
B --> timestamp value
c --> url

These queries take awhile to perform. I would imagine that these queries might take even longer when more data is added to the table. How can I speed up this process? I thought maybe if I were to sort the timestamp value before hand, it would make the queries faster?

GolezTrol · Accepted Answer

I'm new to Hive too, but in general, you can speed up queries like this by adding indexes. You can add indexes on a single field, but often you can also create combined indexes for multiple fields, which add additional performance when you query for a combination of those fields.

Like you say 'sort the timestamp value before hand', that is basically what an index does. You can create an index like so:

CREATE INDEX idx_table
ON TABLE yourtable (A)
AS 'index.handler.class.name'

or a combined index:

CREATE INDEX idx_table2
ON TABLE yourtable (A, B, C)
AS 'index.handler.class.name'

For information about creating indexes in Hive, please read the documentation here:

https://cwiki.apache.org/confluence/display/Hive/IndexDev

Increasing querying performance for huge dataset

Answers (2)

Related Questions