Reputation: 258
I am working on Apache Drill and HDFS in my project.
I am dealing with v.big file (e.g 150GB) and that file is stored in HDFS system. I am writing my Drill query such a way that i will get some amount of data and i will process that (e.g 100 rows) and then again fire a query on that file, so my performance will increase.
(e.g SELECT * FROM dfs.file path
LIMIT 100 )
But every time when i perform a query on that File which is in HDFS system, i am not getting consistent data. It changes every time as Hadoop will fetch that data from any cluster.
Because of that, it may be the case that during the entire process of getting all the record, i may get the same records which i have already.
Upvotes: 0
Views: 90
Reputation: 8249
You might be lucky with using pagination with LIMIT
and OFFSET
, altough I am not sure about it's behaviour with HDFS.
There is a question with a similar approach How to use apache drill do page search and the documentation says:
The OFFSET clause provides a way to skip a specified number of first rows in a result set before starting to return any rows.
(Source)
Upvotes: 1