Reputation: 2450
we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:
insert overwrite directory '/home/output.csv' select books from table;
When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?
Upvotes: 85
Views: 234055
Reputation: 5940
Although it is possible to use INSERT OVERWRITE
to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE
does, then I'll describe the method I use to get tsv files from Hive tables.
According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
A slight modification (adding the LOCAL
keyword) will store the data in a local directory.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;
When I run a similar query, here's what the output looks like.
[lvermeer@hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug 9 09:21 000000_0
[lvermeer@hadoop temp]$ head 000000_0
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE
Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select books from table' > /home/lvermeer/temp.tsv
That gives me a tab-separated file that I can use. Hope that is useful for you as well.
Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
Upvotes: 148
Reputation: 1
Use the command:
hive -e "use [database_name]; select * from [table_name] LIMIT 10;" > /path/to/file/my_file_name.csv
I had a huge dataset whose details I was trying to organize and determine the types of attacks and the numbers of each type. An example that I used on my practice that worked (and had a little more details) goes something like this:
hive -e "use DataAnalysis;
select attack_cat,
case when attack_cat == 'Backdoor' then 'Backdoors'
when length(attack_cat) == 0 then 'Normal'
when attack_cat == 'Backdoors' then 'Backdoors'
when attack_cat == 'Fuzzers' then 'Fuzzers'
when attack_cat == 'Generic' then 'Generic'
when attack_cat == 'Reconnaissance' then 'Reconnaissance'
when attack_cat == 'Shellcode' then 'Shellcode'
when attack_cat == 'Worms' then 'Worms'
when attack_cat == 'Analysis' then 'Analysis'
when attack_cat == 'DoS' then 'DoS'
when attack_cat == 'Exploits' then 'Exploits'
when trim(attack_cat) == 'Fuzzers' then 'Fuzzers'
when trim(attack_cat) == 'Shellcode' then 'Shellcode'
when trim(attack_cat) == 'Reconnaissance' then 'Reconnaissance' end,
count(*) from actualattacks group by attack_cat;">/root/data/output/results2.csv
Upvotes: 0
Reputation: 91
This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.
hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv
Upvotes: 7
Reputation: 1
This shell command prints the output format in csv to output.txt
without the column headers.
$ hive --outputformat=csv2 -f 'hivedatascript.hql' --hiveconf hive.cli.print.header=false > output.txt
Upvotes: 0
Reputation: 565
I may be late to this one, but would help with the answer:
echo "COL_NAME1|COL_NAME2|COL_NAME3|COL_NAME4" > SAMPLE_Data.csv hive -e ' select distinct concat(COL_1, "|", COL_2, "|", COL_3, "|", COL_4) from table_Name where clause if required;' >> SAMPLE_Data.csv
Upvotes: 0
Reputation: 860
hive --outputformat=csv2 -e "select * from yourtable" > my_file.csv
or
hive --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv
For tsv, just change csv to tsv in the above queries and run your queries
Upvotes: 2
Reputation: 483
Just to cover more following steps after kicking off the query:
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
In my case, the generated data under temp folder is in deflate
format,
and it looks like this:
$ ls
000000_0.deflate
000001_0.deflate
000002_0.deflate
000003_0.deflate
000004_0.deflate
000005_0.deflate
000006_0.deflate
000007_0.deflate
Here's the command to unzip the deflate files and put everything into one csv file:
hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv
Upvotes: 0
Reputation: 2485
I tried various options, but this would be one of the simplest solution for Python
Pandas
:
hive -e 'select books from table' | grep "|" ' > temp.csv
df=pd.read_csv("temp.csv",sep='|')
You can also use tr "|" ","
to convert "|" to ","
Upvotes: 1
Reputation: 2375
In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.
It will:
Execute it like this:
set PROXY_HOST=your_bastion_host
set SERVICE_USER=you_func_user
set LINUX_USER=your_SOID
set LINUX_PWD=your_pwd
python hh.py --query_file=query.sql
Upvotes: 0
Reputation: 3324
Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.
Upvotes: 0
Reputation: 63
I had a similar issue and this is how I was able to address it.
Step 1 - Loaded the data from Hive table into another table as follows
DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;
Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension
Start-AzureStorageBlobCopy
-DestContext $destContext
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"
Upvotes: 2
Reputation: 166
The default separator is "^A
". In python language, it is "\x01
".
When I want to change the delimiter, I use SQL like:
SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table
Then, regard delimiter+"^A
" as a new delimiter.
Upvotes: 1
Reputation: 81
You can use INSERT
… DIRECTORY
…, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
OVERWRITE
and LOCAL
have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees
, depending on the number of reducers invoked.
Upvotes: 4
Reputation: 29155
You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )
for ex:
hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv
Upvotes: 2
Reputation: 4287
I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.
To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:
hive -e 'tab_replaced_hql_query' | python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'
This created a perfectly valid csv. Hope this helps those who come looking for this solution.
Upvotes: 3
Reputation: 150
If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.
Upvotes: 3
Reputation: 1076
If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):
hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv
Upvotes: 25
Reputation: 6289
You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.
You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.
Upvotes: 4