Reputation: 944
I have some parquet files stored in HDFS that I want to convert to csv files FIRST and export them in a remote file using ssh.
I don't know if it's possible or simple by writing a spark job (I know that we can convert parquet to csv file JUST by using spark.read.parquet then to the same DF use spark.write as a csv file). But I really wanted to do it by using a impala shell request.
So, I thought about something like this :
hdfs dfs -cat my-file.parquet | ssh myserver.com 'cat > /path/to/my-file.csv'
Can you help me PLEASE with this request ? Please. Thank you !!
Upvotes: 3
Views: 1462
Reputation: 2828
You can do that by multiples ways.
One approach could be as in the example below.
With impala-shell
you can run a query and pipe to ssh
to write the output in a remote machine.
$ impala-shell --quiet --delimited --print_header --output_delimiter=',' -q 'USE fun; SELECT * FROM games' | ssh [email protected] "cat > /home/..../query.csv"
This command change from default database
to a fun database
and run a query on it.
You can change the --output_delimiter='\t'
, --print_header
or not along with other options.
Upvotes: 0
Reputation: 18023
Example without kerberos:
impala-shell -i servername:portname -B -q 'select * from table' -o filename '--output_delimiter=\001'
I could explain it all, but it is late and here is a link that allows you to do that as well as the header if you want: http://beginnershadoop.com/2019/10/02/impala-export-to-csv/
Upvotes: 1