Liondancer
Liondancer

Reputation: 16469

Increasing querying performance for huge dataset

I am new to Hive and SQL. I am currently querying the database to extract rows of data based on

SELECT * FROM database.table WHERE A = '980dsf9sfjklsdfj' AND B = '141519384938' AND C = 'URL'

A --> some id value
B --> timestamp value
c --> url

These queries take awhile to perform. I would imagine that these queries might take even longer when more data is added to the table. How can I speed up this process? I thought maybe if I were to sort the timestamp value before hand, it would make the queries faster?

Upvotes: 0

Views: 56

Answers (2)

Arani
Arani

Reputation: 182

Is you table partitioned? If not, I would suggest you create a new partitioned external table (based on URL) and load the data from old table to new one. You will need to use dynamic partitioning here. This will definitely improve performance.

Also, depending on the cardinality of id field, you may want to bucket your data based on ID.

Upvotes: 1

GolezTrol
GolezTrol

Reputation: 116100

I'm new to Hive too, but in general, you can speed up queries like this by adding indexes. You can add indexes on a single field, but often you can also create combined indexes for multiple fields, which add additional performance when you query for a combination of those fields.

Like you say 'sort the timestamp value before hand', that is basically what an index does. You can create an index like so:

CREATE INDEX idx_table
ON TABLE yourtable (A)
AS 'index.handler.class.name'

or a combined index:

CREATE INDEX idx_table2
ON TABLE yourtable (A, B, C)
AS 'index.handler.class.name'

For information about creating indexes in Hive, please read the documentation here:

https://cwiki.apache.org/confluence/display/Hive/IndexDev

Upvotes: 2

Related Questions