Reputation: 11

Impala query returns data in random order

I would like my select * query of a table to return in the same order as what is present in the DB. However, it returns the data in a random order. While executing the same query in Hive, I get the dataset in the correct order. Is there a way in which I can make impala return the result set in the same order as is present in the DB?

Upvotes: 0

Answers (1)

leftjoin

Reputation: 38325

Whithout ORDER BY the order of rows returned by query is not defined. Due to parallel and distributed execution, the order returned may vary from run to run, some process can be executed faster, some process can wait in the queue, all of them will emit data independently from each other.

Also according to the classic Codd relational theory, the order of rows in a table and order of columns is immaterial to the database. You can sort data during insert into the table and sorted data will be compressed much better, internal indexes and bloom filters will work better, but the order of rows in returned dataset is not guaranteed without ORDER BY. The same applies to Hive, in some cases when there is single mapper has started and no reducers, the data will be returned in the same order as it is in the datafile, but do no rely on it, add ORDER BY if you need ordering.

Only single thread processing can return data in the same order, but this will kill performance. Better redesign your data flow and add some ordering column to be able to order rows during select in distributed environment.

Upvotes: 1

Impala query returns data in random order

Answers (1)

Related Questions