Alejandro Rodriguez
Alejandro Rodriguez

Reputation: 252

Automating Hive with python

I am running hive 0.12, and I'd like to run several queries and get the result back as a python array.

for example:

result=[]
for col in columns:
  sql='select {c} as cat,count(*) as cnt from {t} group by {c} having cnt > 100;'.format(t=table,c=col)
  result.append(hive.query(sql))
result=dict(result)

What I'm missing, is the hive class to run SQL queries.

How can this be done ?

Upvotes: 2

Views: 5662

Answers (3)

brandon.bell
brandon.bell

Reputation: 1411

You could also access Hive using Thrift. https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python. It looks like pyhs2 is mostly a wrapper around using Thrift directly.

Upvotes: 1

rchang
rchang

Reputation: 5236

One alternative is to use the pyhs2 library to open a connection to Hive natively from within a Python process. The following is some sample code I had cobbled together to test a different use case, but it should hopefully illustrate use of this library.

# Python 2.7
import pyhs2
from pyhs2.error import Pyhs2Exception

hql = "SELECT * FROM my_table"
with pyhs2.connect(
  host='localhost', port=10000, authMechanism="PLAIN", user="root" database="default"
  # Use your own credentials and connection info here of course
) as db:
  with db.cursor() as cursor:

    try:
      print "Trying default database"
      cursor.execute(hql)
      for row in cursor.fetch(): print row
    except Pyhs2Exception as error:
      print(str(error))

Depending on what is or is not already installed on your box, you may need to also install the development headers for both libpython and libsasl2.

Upvotes: 1

Uri Goren
Uri Goren

Reputation: 13682

One quick and dirty way to do this, is to automate hive from the command line

hive -e "sql command"

Something like this should work

def query(self,cmd):
    """Run a hive expression"""
    cmd='hive -e "'+cmd+'"';
    prc = subprocess.Popen(cmd, stdout=subprocess.PIPE,stderr=subprocess.PIPE, shell=True)
    ret=stdout.split('\n')
    ret=[r for r in ret if len(r)]
    if (len(ret)==0):
         return []
    if (ret[0].find('\t')>0):
         return [[t.strip() for t in r.split('\t')] for r in ret]
    return ret

Upvotes: 2

Related Questions