Olba12
Olba12

Reputation: 335

Connect to hive with sqlalchemy odbc.ini file

I'd like to connect to Hive using sqlalchemy and odbc.ini file

I have a odbc.ini file that looks like

[Hive]
Description = ODBC Hive connection to Hadoop cluster
Driver = /opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so
HOST = host1,host2,host3
Schema = default
ServiceDiscoveryMode = 1
ZKNamespace = hiveserver2
HiveServerType = 2
AuthMech = 1
ThriftTransport = 1
UseNativeQuery = 0
KrbHostFQDN = _HOST
KrbServiceName = hive
ServicePrincipalCanonicalization = 1
SSL = 0
TwoWaySSL = 0

I tried to connect with

import sqlalchemy
import pandas as pd

query_test = """SELECT * FROM my_table limit 1;"""

engine =  sqlalchemy.create_engine("hive://@Hive")
conn = engine.connect()
print(pd.read_sql(query_test, conn))

which produces the error TTransportException: failed to resolve sockaddr for Hive:10000

in

~/folder/lib64/python3.6/site-packages/thrift/transport/TSocket.py in open(self)
    125             msg = 'failed to resolve sockaddr for ' + str(self._address)
    126             logger.exception(msg)
--> 127             raise TTransportException(type=TTransportException.NOT_OPEN, message=msg,         inner=gai)
    128         for family, socktype, _, _, sockaddr in addrs:
    129             handle = self._do_open(family, socktype)

I've been told that this is because we have multiple host and have to go by something called zookeeper

What I cant understand is that when I use pyodbc there is no problem. The following works just fine

pyodbc.autocommit = True
cnxn = pyodbc.connect(dsn='Hive', autocommit=True)
data = pd.read_sql(query, cnxn, params=params)
cnxn.close()

How should I configure my sqlalchemy code to work with my odbc.ini file?

I have

PyHive-0.6.5
thrift-0.16.0
thrift_sasl-0.4.3
SQLAlchemy-1.4.36
sasl-0.3.1
pyodbc-4.0.32

Upvotes: 0

Views: 1461

Answers (1)

ljmc
ljmc

Reputation: 5294

You seem to want to use the Hive ODBC Connector from Cloudera to connect to Hive, but then you use a hive:// URI, which mean SQLAlchemy is going to try to use pyHive, which is unaware of odbc.ini.

To connect through ODBC from SQLAlchemy you need to use a <dialect>+pyodbc:// URI, such as mssql+pyodbc://, mysql+pyodbc:// or sybase+pyodbc://.

There is no equivalent dialect for Hive + ODBC, so even if you manage a Hive connection through pyODBC in SQLAlchemy, it would not know how to speak to Hive. I found this repo AurelienGalicher/PyHiveODBC which seems to contain a first draft of a Hive + ODBC dialect, but I have not tested it.

The pyHive documentation points to connecting SQLAlchemy directly through a hive:// URI.

So either you need to develop a Hive + ODBC dialect, or use pyHive, which does not use ODBC at all (and therefore no odbc.ini DSN).

Upvotes: 2

Related Questions