Reputation: 4352
I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
Upvotes: 5
Views: 20116
Reputation: 11
Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()
Upvotes: 1
Reputation: 401
1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla
Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)
Upvotes: 2
Reputation: 21
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)
Upvotes: 0
Reputation: 29155
You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer
Upvotes: 2