Reputation: 727
After I execute a hive query via CLI like below:
$ hive -e QUERY > output.txt
The flow looks like below graph:
==============
Hadoop Cluster
==============
| |
| |
| 2. output RESULT as a single .gz file at HDFS because of 1 reducer
| |
| |
1. QUERY |
| |
| 3. Hive retrieves the RESULT as stream or a whole file ?
| If as a whole file, what happens when file size > memory size ?
| |
| |
===========
Hive Client
===========
|
|
4. Client outputs RESULT to stdout which is redirected to a file
|
|
===========
Output File
===========
My question is: If the single result file at HDFS is super big, even bigger than my local physical memory size, how does Hive client handle it ?
Does Hive client retrieve the file
Upvotes: 2
Views: 454
Reputation: 2924
You are getting the results as a stream, so if you haven't redirected the output, no temporary files are included in your procedure. You could imagine it as doing hadoop fs -cat /THE/RESULT/FILE/OF/YOUR/HIVE/REQUEST
If the result will be a large data, you could re put them on an hdfs location :
$ hive -e QUERY | hadoop fs -put - /HDFS/LOCATION
But here you should pay attention to the network as it might be saturated
Another alternative is to store the data immidiately to another Hive table, in this way Hive will do all the job for you and no reuslts will be streamed/copied to your local machine
Upvotes: 2