Reputation: 252
I am running hive 0.12, and I'd like to run several queries and get the result back as a python array.
for example:
result=[]
for col in columns:
sql='select {c} as cat,count(*) as cnt from {t} group by {c} having cnt > 100;'.format(t=table,c=col)
result.append(hive.query(sql))
result=dict(result)
What I'm missing, is the hive
class to run SQL queries.
How can this be done ?
Upvotes: 2
Views: 5662
Reputation: 1411
You could also access Hive using Thrift. https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python. It looks like pyhs2 is mostly a wrapper around using Thrift directly.
Upvotes: 1
Reputation: 5236
One alternative is to use the pyhs2 library to open a connection to Hive natively from within a Python process. The following is some sample code I had cobbled together to test a different use case, but it should hopefully illustrate use of this library.
# Python 2.7
import pyhs2
from pyhs2.error import Pyhs2Exception
hql = "SELECT * FROM my_table"
with pyhs2.connect(
host='localhost', port=10000, authMechanism="PLAIN", user="root" database="default"
# Use your own credentials and connection info here of course
) as db:
with db.cursor() as cursor:
try:
print "Trying default database"
cursor.execute(hql)
for row in cursor.fetch(): print row
except Pyhs2Exception as error:
print(str(error))
Depending on what is or is not already installed on your box, you may need to also install the development headers for both libpython
and libsasl2
.
Upvotes: 1
Reputation: 13682
One quick and dirty way to do this, is to automate hive from the command line
hive -e "sql command"
Something like this should work
def query(self,cmd):
"""Run a hive expression"""
cmd='hive -e "'+cmd+'"';
prc = subprocess.Popen(cmd, stdout=subprocess.PIPE,stderr=subprocess.PIPE, shell=True)
ret=stdout.split('\n')
ret=[r for r in ret if len(r)]
if (len(ret)==0):
return []
if (ret[0].find('\t')>0):
return [[t.strip() for t in r.split('\t')] for r in ret]
return ret
Upvotes: 2