mnm
mnm

Reputation: 2022

How to access Hive on remote server using python client

Case: I have Hive on a cloudera platform. There is a database on Hive that I want to access using python client from my computer. I read a similar SO question but its using pyhs2 which I am unable to install on the remote server. And this SO question too uses Thrift but I cant seem to install it either.

Code: After following the documentation, when I execute the following program it gives me an error.

import pyodbc, sys, os
   pyodbc.autocommit=True
   con = pyodbc.connect("DSN=default",driver='SQLDriverConnect',autocommit=True)  
   cursor = con.cursor()
   cursor.execute("select * from fb_mpsp")

Error: ssh://ashish@ServerIPAddress/home/ashish/anaconda/bin/python2.7 -u /home/ashish/PyCharm_proj/hdfsConnect/home/ashish/PyCharm_proj/hdfsConnect/Hive_connect/hive_connect.py

Traceback (most recent call last): File "/home/ashish/PyCharm_proj/hdfsConnect/home/ashish/PyCharm_proj/hdfsConnect/Hive_connect/hive_connect.py", line 5, in con = pyodbc.connect("DSN=default", driver='SQLDriverConnect',autocommit=True)

pyodbc.Error: ('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found, and no default driver specified (0) (SQLDriverConnect)')

Process finished with exit code 1

Please suggest how can I solve this problem? Also I am not sure why do I have to specify the driver as SQLDriverConnect when the code will be executed using hadoop hive?

Thanks

Upvotes: 1

Views: 10759

Answers (3)

shashankS
shashankS

Reputation: 1073

Try this method also to conenct and get data remotely from hive server: connect remote server with ssh and give the cli command to access data from remote server: ssh -o UserKnownHostsFile=/dev/null -o ConnectTimeout=90 -o StrictHostKeyChecking=no shashanks@remote_host 'hive -e "select * from DB.testtable limit 5;" >/home/shashanks/testfile'

Upvotes: 1

ozw1z5rd
ozw1z5rd

Reputation: 3208

This worked for me

oODBC = pyodbc.connect("DSN=Cloudera Hive DSN 64;", autocommit = True, ansi = True )

And now everything works fine.

Be sure anything is fine with you DSN using:

isql -v "Cloudera Hive DSN 64" 

and replace "Cloudera Hive DSN 64" with the name you used in your odbc.ini

Also, currently I'm not able to use the kerberos authentication unless I make a ticket by hand. Impala works smoothly using kerberos keytab files

Any help about how to have hive odbc working with keytab files is appreciated.

Upvotes: 2

Tristan Reid
Tristan Reid

Reputation: 6154

If you do decide to revisit pyhs2 note that it doesn't need to be installed on the remote server, it's installed on your local client.

If you continue with pyodbc, you need to install the ODBC driver for Hive, which you can get from Cloudera's site.

You don't need to specify the driver in your connection, it should be part of your DSN. The specifics of creating the DSN depend on your OS, but essentially you will create it using Administrative Tools -> Data Sources (Windows), install ODBC and edit /Library/ODBC/odbc.ini (Mac), or edit /etc/odbc.ini (Linux).

Conceptually, think of the DSN as a specification that represents all the information about the connection - it will contain the host, port, and driver information. That way in your code you don't have to specify these things and you can switch details about the database without changing your code.

# Note only the DSN name specifies the connection
import pyodbc
conn = pyodbc.connect("DSN=Hive1") 
cursor = conn.cursor()
cursor.execute("select * from YYY")

Alternatively, I've updated the other question you referenced with information about how to install the thrift libraries. I think that's the way to go, if you have that option.

Upvotes: 1

Related Questions