How does Hive CLI retrieve huge result files from HDFS?

Question

After I execute a hive query via CLI like below:

$ hive -e QUERY > output.txt

Hive client will compile the QUERY and send it to Hadoop cluster.
Hadoop executes some jobs and outputs result to a file (assume only 1 reducer) at HDFS.
Then Hive client will retrieve this single file, extract it, and output to local STDOUT.

The flow looks like below graph:

==============
Hadoop Cluster
==============
  |         |
  |         |
  |     2. output RESULT as a single .gz file at HDFS because of 1 reducer
  |         |
  |         |
1. QUERY    |
  |         |
  |     3. Hive retrieves the RESULT as stream or a whole file ?
  |        If as a whole file, what happens when file size > memory size ?
  |         |
  |         |
  ===========
  Hive Client
  ===========
      |
      |
  4. Client outputs RESULT to stdout which is redirected to a file
      |
      |
 ===========
 Output File
 ===========

My question is: If the single result file at HDFS is super big, even bigger than my local physical memory size, how does Hive client handle it ?

Does Hive client retrieve the file

as a stream ?
put it to some temporary swap file ?
or something else ?

user1314742 · Accepted Answer

You are getting the results as a stream, so if you haven't redirected the output, no temporary files are included in your procedure. You could imagine it as doing hadoop fs -cat /THE/RESULT/FILE/OF/YOUR/HIVE/REQUEST

If the result will be a large data, you could re put them on an hdfs location :

$ hive -e QUERY | hadoop fs -put - /HDFS/LOCATION

But here you should pay attention to the network as it might be saturated

Another alternative is to store the data immidiately to another Hive table, in this way Hive will do all the job for you and no reuslts will be streamed/copied to your local machine

How does Hive CLI retrieve huge result files from HDFS?

Answers (1)

Related Questions