Reputation: 13
I have been struggling to run Hive queries from the HiveOperator task. Hive and Airflow are installed in docker containers and I can query Hive tables from python code from the Airflow container and also via Hive CLI successfully. But when I run Airflow DAG, I see an error stating that the hive/beeline file is not found.
DAG:
dag_hive = DAG(dag_id = "hive_script",
schedule_interval = '* * * * *',
start_date = airflow.utils.dates.days_ago(1))
hql_query = """
CREATE TABLE IF NOT EXISTS mydb.test_af(
`test` int);
insert into mydb.test_af values (1);
"""
hive_task = HiveOperator(hql = hql_query,
task_id = "hive_script_task",
hive_cli_conn_id = "hive_local",
dag = dag_hive
)
hive_task
if __name__ == '__main__ ':
dag_hive.cli()
Log:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/hive/operators/hive.py", line 156, in execute
self.hook.run_cli(hql=self.hql, schema=self.schema, hive_conf=self.hiveconfs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/hive/hooks/hive.py", line 249, in run_cli
hive_cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=tmp_dir, close_fds=True
File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'beeline': 'beeline'
[2021-08-19 12:22:04,291] {taskinstance.py:1551} INFO - Marking task as FAILED. dag_id=***_script, task_id=***_script_task, execution_date=20210819T122100, start_date=20210819T122204, end_date=20210819T122204
[2021-08-19 12:22:04,323] {local_task_job.py:149} INFO - Task exited with return code 1
It would be great if someone helps me. Thanks in advance...
Upvotes: 1
Views: 1520
Reputation: 1036
This is my Dockerfile based on the responses here and others. You need to download hadoop and hive libraries, unpack them and update my Dockerfile below for the correct versions. Download the "bin.tar.gz" files and not the "src.tar.gz" files. Download them from:
https://hadoop.apache.org/releases.html
https://hive.apache.org/general/downloads/
unpack them with tar -xvzf <filename>
Once you build the image and spin up airflow you should be able to connect to your hive instance. I had to add this in the "Extra" box, however that will depend on your hive installation: { "auth_mechanism": "CUSTOM" }
FROM apache/airflow:2.6.2
# Install OpenJDK-8
USER root
RUN apt-get update && \
apt-get update && \
apt-get install -y openjdk-11-jdk && \
apt-get install -y ant && \
apt-get clean;
#need to install these packages for the "pip install hive" to work below
RUN apt-get install -y --no-install-recommends g++
RUN apt-get install -y --no-install-recommends libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit
USER airflow
# Setup JAVA_HOME -- useful for docker commandline
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/
#hadoop
COPY hadoop-3.3.6 /hadoop-3.3.6
ENV HADOOP_HOME=/hadoop-3.3.6
#hive
COPY apache-hive-3.1.3-bin /apache-hive-3.1.3-bin
#upgrade pip as we were getting errors running pip install
RUN python -m pip install --upgrade pip --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org
RUN pip install "apache-airflow==${AIRFLOW_VERSION}" --no-cache-dir --progress-bar off apache-airflow-providers-apache-hive
#run this to confirm files we copied exist
RUN ls -l /
#run this to confirm env vars set correctly
RUN export
#confirm java works
RUN java -XshowSettings:properties -version 2>&1
#confirm beeline works
RUN /apache-hive-3.1.3-bin/bin/beeline --version 2>&1
Upvotes: 0
Reputation: 20097
You need to install beeline in the Apache Airflow image. It depends on what Airflow image you are using, but the Airflow's "Reference" image contains only most common providers and hive is not among them. You should extend or customise the image to add beeline to be available in your path in airflow image.
You can read more about extending/customising Airflow image at https://airflow.apache.org/docs/docker-stack/build.html#adding-new-apt-package
Upvotes: 2