Nifi: Cannot import pyspark in ExecuteScript processor

Question

I need to implement ExecuteScript in Nifi in order to do column transposition, and I am using pyspark as means to do that.

But the problem says "failed to process due to javax.script.ScriptExeption: ImportError: No module named pyspark in at line number 1:"

I set the path to spark and pyspark like this for module directory setting in ExecuteScript property.

C:\Users\username\Desktop\spark\spark-2.4.3-bin-hadoop2.7\hadoop,
C:\Users\username\Desktop\spark\spark-2.4.3-bin-hadoop2.7\bin\pyspark

But it did not work.

I am afraid this is very fundamental issue, could not figure out half a day..

Andy · Accepted Answer

This is likely because the pyspark module is a natively-compiled Python module, and Apache NiFi uses Jython in the ExecuteScript processor. This is a known issue, and the full explanation is here, as well as some work-arounds and details on options.

The simplest answer is to use ExecuteStreamCommand and pass the necessary flowfile attributes as arguments, and the content as STDIN. The output of the Python script will be returned via STDOUT and captured as the new flowfile content.

Nifi: Cannot import pyspark in ExecuteScript processor

Answers (1)

Related Questions