Reputation: 1
the pyspark @Functions.pandas_udf functional (decorator) will result in an irresolvable dependency bug in the Foundry Build Service environment. It uses pyarrow which uses an openssl version that the build system environment does not have, and even installing it in the user / project environment by putting it in the meta.yml doesn't resolve the problem. stdout:
ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.
Traceback (most recent call last):
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
import pyarrow
File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/pyarrow/__init__.py", line 65, in <module>
import pyarrow.lib as _lib
ImportError: /app/work-dir/__python_runtime_environment__/__SYMLINKS__/lib-dynload/../../libcrypto.so.3: version OPENSSL_3.4.0' not found (required by /app/work-dir/__environment__/__SYMLINKS__/site-packages/pyarrow/../../../././libssl.so.3)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms_spark_module/delegate.py", line 100, in _execute_job
result = job.run(
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/_build.py", line 329, in run
self._transform.compute(**kwargs, **parameters)
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 334, in compute
output_df: Union[DataFrame, Any] = self(**kwargs)
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 183, in __call__
return self._compute_func(*args, **kwargs)
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/data_anon.py", line 15, in compute
"case_weight": redist(df, "case_weight"),
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/utils.py", line 32, in redist
df = df.withColumn(column_name, add_noise(column_name, dist))
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/utils.py", line 14, in add_noise
@F.pandas_udf(DoubleType())
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/functions.py", line 338, in pandas_udf
require_minimum_pyarrow_version()
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/utils.py", line 60, in require_minimum_pyarrow_version
raise ImportError(
ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.
Upvotes: 0
Views: 47
Reputation: 1
the PySpark error message is misleading and, as you have correctly identified, is caused by a different OpenSSL version being present in the Conda environment and the Foundry Build's environment.
Newer version of Python Transforms ship with OpenSSL 3.4.0, so you can fix this issue by making sure you don't pin the openssl version in your meta.yaml file and by upgrading your repository to the latest template version.
Upvotes: 0