Pratik Joshi
Pratik Joshi

Reputation: 258

Apache Drill Query data rertival is not constant on HDFS system

I am working on Apache Drill and HDFS in my project.

I am dealing with v.big file (e.g 150GB) and that file is stored in HDFS system. I am writing my Drill query such a way that i will get some amount of data and i will process that (e.g 100 rows) and then again fire a query on that file, so my performance will increase. (e.g SELECT * FROM dfs.file path LIMIT 100 )

But every time when i perform a query on that File which is in HDFS system, i am not getting consistent data. It changes every time as Hadoop will fetch that data from any cluster.

Because of that, it may be the case that during the entire process of getting all the record, i may get the same records which i have already.

Upvotes: 0

Views: 90

Answers (1)

tobi6
tobi6

Reputation: 8249

You might be lucky with using pagination with LIMIT and OFFSET, altough I am not sure about it's behaviour with HDFS.

There is a question with a similar approach How to use apache drill do page search and the documentation says:

The OFFSET clause provides a way to skip a specified number of first rows in a result set before starting to return any rows.

(Source)

Upvotes: 1

Related Questions